- Home
- /
- 01 – Reguläre Ausdrücke
- /
- 01.02 – Regular Expressions...
- /
- 01.02.02 Simple Patterns
To explain some regular expressions and to understand the examples given in this tutorial we have to define how the regex will appear. I will envelope the regular expression in quotation marks („). If you want to test the regex you will have to copy the part between the „-characters.
Testing Regex
Testing regular expressions? Yes, sure, this is possible.
Goto the home of Regex-Coach, download the software and install it. Please follow the instructions on that site!
Please, I really recommend that you download this utility. It will make it so much easier to follow the tutorial.
Simple Known Characters
Ok, let’s start with simple search patterns: „give or take“
Yes, you won’t believe it, this is already a regular expression: it
matches the string ‚give or take‘ in a text. Exactly these characters!
And no, this does not mean that this pattern matches either ‚give‘ or
‚take‘. The regular expression only matches if the characters in
quotation marks appear somewhere in the text!
Regular expressions are stubborn and stupid: they will look for exactly
what they are told to search for. They are case sensitive and they are
not interested in word boundaries unless told to be so. For example, our
first regex will find the characters in the following string: ‚You have
to forgive or take the consequences!‘
Search Patterns for Metacharacters
Regular expressions can search for any character – alphanumeric, hexadecimal, binary numbers, etc..
A small but important exception are those characters that have a special meaning in regular expressions, the metacharacters.
Metacharacters are:
* + ? . ( ) [ ] { } \ / | ^ $
(Hi experts: Yes, you are right! I stretched the truth!! These are not
actually all metacharacters. But trust me, just assume that I am right
for now. We will see later why I prefer to define the above as
metacharacters).
I will explain these metacharacters later in the tutorial, step by step,
as many of them as necessary. Just one thing for now. If you want to
search for those characters as they stand you have to tell the regex
that you want to do so. The regular expression has to be told that you
don’t mean to use a metacharacter but want to search for it literally.
So you have to „escape“ or „mask“ the character with another character
(which of course is a metacharacter in itself <g>): it is the
backslash „\“
If you want to match a question mark the regex has to be „\?“. If it is a
slash you’re after, you have to enter „\/“. And, although it looks
queer, if you want to find a backslash you need to type two of them „\\“
Simple Unknown Characters
The
first metacharacter we are going to learn is a dot „.“ It represents
exactly one unknown character we want to match, no matter what this
character might be (Hello experts: let’s come to exceptions later. Ok?)
„M.ller“ will match ‚Miller‘, ‚Meller‘ or ‚Miller‘ within the word
‚Millerton‘ but not ‚Milton Keynes‘. „h..s“ matches ‚hips‘ or ‚hers‘.
And within the word ‚house‘, the same regex will match ‚hous‘.
Later we will learn about some more metacharacters; ones that will allow
us to look for more than one unknown character without repeating the
dot over and over.
Groups of Characters and Character-Classes
Some
metacharacters define groups of characters, making a very powerful
tool. There is a wide variety of these groups. Let’s start with the easy
ones:
„\d“ symbolizes a digit. „\d\d“ searches for any sequence of two digits.
„\w“ stands for any letter or digit or the underscore character (word). This group is called ‚alphanumeric characters‘.
With what we already know we can create our first more complicated looking regex:
„Re \[\d\]:“ searches for the string ‚Re‘ followed by a space, an
opening square bracket, any digit, a closing square bracket and finally a
colon in a text. Ooops, that looks like a Subject-line which was
created by someone who forgot the %SINGLERE in his reply template 😉
There are -of course- metacharacters which have the opposite meaning: „\W“ and „\D“ (non-Digit and non-Word)
\W is stands for any non-alphanumeric character and \D means any character that is not a digit.
Another elegant method to define your own group of characters is to use
the square brackets [ ] which stands for ‚character classes‘. With
square brackets, the regex will search for exactly one character, no
matter how many characters are in between these brackets: „[AEX]“. This
combination will match any one-letter string that must be one of A, E or
X.
You may even define ranges of characters. You don’t have to type in
every character of the range, no; regexian makes it easy for you: just
enter the first character of the range, a hyphen „-“ and the last
letter: „[e-z]“ means that all letters from e to z should be matched.
„[AEXe-z]“ is a combination of both: a one-letter string with one of
A,E,X or any letter within the range e to z.
This is a powerful tool in regexian: „[0-1][0-9]\/[0-3][0-9]\/“ will
match only a MM/DD/ formatted date. Other combinations which are not a
date (e.g. 35/47/) won’t be found. (Yeah, you’re right! My regex will
match 19/39/ which isn’t a terrestrial date at all. We will get this one
later once we have learned some more elements….)
You can negate character classes with one keystroke. Just add a „^“
after the opening square bracket and that’s it. ‚Find any character as
long as it isn’t 1,2,3 or 4!‘ in regexian is: „[^1-4]“. Oh, we should
remember this one for later. This funny ‚^‘ character has a totally
different meaning when not in square brackets!
Overview and Summary
What did we learn in this chapter?
- regular expressions search for any character. „er“ looks for the exactly these letters in that order. All regex are case sensitive unless told not to be so.
- Regexes use characters with a special meaning: metacharacters. To find them literally they must be escaped. This is done with a preceding backslash: * + ? . ( ) [ ] { } \ / | ^ $
- a dot „.“ is used to a single unknown character. It is a metacharacter.
- There are metacharacters which symbolize groups of characters like
\d for digits ([0-9])
\D for non-digits ([^0-9])
\w for alphanumeric characters ([a-zA-Z0-9_])
\W for non-alphanumeric characters ([^a-zA-Z0-9_]) - It is possible to define your own set of character-classes by using square brackets e.g. „[A-Z]“. A ^ as first character in the square bracket negates the class.
Exercises
What does each regex match
1. "\d\d\.\d\d\.\d\d\d\d"
2. "\w\w\w, \d\d \w\w\w \d\d\d\d"
3. ".. \[[0-9]\]:"
4. "[a-zA-Z]"
1. Example:
first it will match two digits. Next comes the backslash and a dot. That means, the dot is escaped and is no metacharacter. So the two digits has to be followed by a dot. Again two digits and a dot. And finally four digits! This is the European format of a date DD.MM.YYYY
2. Example
In
the second example the regex searches for three alphanumeric characters
followed by a comma, a space, two digits, another space. Next come
three alphanumeric characters, a space and finally four digits. Phew,
what could this be?
Well, it looks like a format for dates again, but this time in an
Anglo-American format: Tue, 19 Feb 2002. Well, like the first example,
this regex is not perfect. It only matches dates with two-digit days. We
will see later how we can modify the regex to find one- or two-digit
days
3. Example:
the regex looks for two characters and a space. The next character is a square bracket. Then a square bracket follows which isn’t escaped by a preceding backslash: this defines group of characters! Any digit in the range 0 to 9 is going to match the string. A square bracket again and a colon. This combination would match ‚Re [2]:‘
4. Example:
In the last example the regex looks for only one character. Any letter is allowed, even capital letters. Why isn’t „\w“ used? Well, that would include the underscore and perhaps the author doesn’t want to match that character 😉