- Home
- /
- 01 – Reguläre Ausdrücke
- /
- 01.02 – Regular Expressions...
- /
- 01.02.03 Complex Patterns
Ok,
that was an easy start! But it wasn’t very interesting, was it? But if
simple search patterns were all that „Regular Expressions“ offer, it
wouldn’t be worth a tutorial.
So, there has to be more! Okay, let’s get going with the more complicated stuff:
Line Boundaries
Instead
of having a regex look for text anywhere in the string we can force it
to search in specific parts of the string. These „anchored“ patterns
have their own metacharacters: ^ and $
The circumflex ^ means that the search pattern is anchored to the start
of the line; the dollar $ means that the regex will look for the pattern
at the end of a line (Yes, dear experts, for now, let’s take a string
as one line. Ok?)
Example: „^give or take“ This pattern will only be matched if ‚give‘ is at the beginning of a line and is followed by ‚or take‘.
Or: „This is the end$“ is only matched if it appears at the end of the
line. It doesn’t matter what comes first: ‚This is the end‘ has to be
the end of the line!
You can use these two metacharacters to speed up the regex. I admit, it
is not all that important when you use regex in TB! because you won’t be
working with large amounts of data. But on the other hand: it can’t
hurt anyone 😉 Why does the regex work faster if you use the circumflex
or the dollar, you ask? Ok, let’s use our example regex „^give or take“
on the string ‚Once upon a time‘: the regex machine checks whether the
first thing it finds is the beginning of the line. This returns TRUE.
Next it checks the following character whether it is a ‚g‘. The search
process is cancelled at once because this returns FALSE! Now what would
have happened without the circumflex? The regex machine would have
checked the second, third, fourth etc. character to match the search
pattern, only to find out that the search pattern doesn’t exist in that
string. The longer the string, the more time the regex machine takes to
fail 😉
Word Boundaries
But
there is more that regexian offers. Word boundaries! Some people forget
about this because they think there is another way to define word
boundaries. Believe me, there is, but it’s nowhere near as easy as this!
„\b“ makes the regex searching for the pattern at word boundaries: „\bgive or take“.
Hey, we know this one, don’t we? That is our first example again! The
pattern that was found in ‚You have to forgive or take the
consequences!‘ but now won’t be found thanks to the word boundary
metacharacter.
I remember a discussion in one of the German TB-lists where someone
asked why this metacharacter is necessary, because a word could be
recognized by surrounding spaces. This is not a good idea: words could
end at question marks, exclamation marks, a full stop…. A regex like
„ain “ would indeed match ‚Again a good idea‘ but wouldn’t find ‚Oh no,
not again.‘ You can avoid that when you use „\b“ instead.
Of course, this metacharacter can be negated, as can the others: „\B“
which means that the regex should match characters everywhere in a
string other than at word boundaries.
Another example should explain this: „Re\B.“ The regex has to match the
characters ‚Re‘ as long as they are not a word boundary, followed by any
other character (the dot). Now, we have the string: ‚Re: or Reply:‘.
Try it in the regex tester. What happens? The result is ‚Rep‘. Replace
\B by \b and the regex matches ‚Re:‘. Everything clear now?
Alternatives
You
remember the first example in this tutorial „give or take“? When I
introduced it I made the redundant remark that this regex wouldn’t match
‚give‘ OR ‚take‘. Well, this remark wasn’t really redundant: I needed
something to start this chapter, some kind of transition <bg>.
Because this is the chapter that explains how we can use the OR; how
alternative patterns are defined.
To search for alternative patterns, regexian offers a special
metacharacter: it is the vertical bar or may be better known as
pipe-symbol „|“. So, what would have been necessary to search for ‚give‘
or ‚take‘? „give|take“. The regex checks whether it matches ‚give‘. If
not it checks the string for ‚take‘.
What happens if the string contains both alternatives? Well, to be
honest, when I started with regex I was convinced that the first
alternative in the regular expression would be matched. But no! The
regex will match the alternative that comes first in the string! Let’s
get into details with an example:
Given the regex „this|the|that“ and the string ‚the hand that signed
this paper‘ (Ok, ok. You didn’t really expect sample strings from
Shakespeare or Yeats, did you?) What does the regex return? ‚the‘ is the
answer! Try it in the regex-tester!
You may combine alternatives as you have seen in the last example. Just
have a look at the following „^re:|^aw:|^fwd:“. This means that in all
three alternatives the regex has to match the beginning of the line
first. Some characters follow and each alternative ends with a colon.
Yep, you are right: there must be a way to simplify this one. And like
in Mathematics you can use brackets to make the regex shorter
„^(re|aw|fwd):“.
Well, those simplifications do not necessarily make it easier to read:
„th(is|e|at)“ would be a correct and simple alternative to the first
example in this chapter but it is not exactly an easy-to-read example.
😉
Special Character Groups and Classes
We
have already introduced some of the special search patterns for groups
and classes of characters. I would like to present some others with
varying significance.
In almost every real regex you find the character class „\s“ . It
represents so-called whitespace characters, that is any character which
produces white space on the screen: space, tabs, newline, carriage
return, line feed. It’s ok if you just remember that any void space in a
string will be matched. And, of course, you may negate this pattern:
„\S“ matches any character that does not appear as white space in the
string.
„\A“ is a seldom used search pattern: it matches the beginning of the
string. This is not the beginning of the line; no, to search for that we
would have used ^. Later when we talk about options like multiline you
will see where you can use this one. „\Z“ is related to „\A“: \Z matches
the end of the string and again I can only say: „This is not end of
line“ because that would have been $. You will see the difference when
we talk about options. Sorry, but you have to be patient 🙂
Overview and Summary
This chapter explained some more possibilities in defining search patterns:
- line boundaries are matched by circumflex ^ (beginning of a line) and dollar $ (for end of a line).
- Word boundaries are matched by „\b“. It searches for characters that appear at the beginning or the end of a word. „\B“ represents characters that do not appear at the end or the beginning of words.
- Search patterns can contain alternative characters to match. The alternatives are separated by a vertical bars „|“. Characters that appear at the same place in each alternative can be placed before or after brackets that enclose the alternatives. „^(Re|Aw|Fwd)“ All alternatives must appear at the beginning of the line in a string to be matched.
- Spaces or tabs are so-called whitespace characters for which a special search pattern exists: \s The negation is \S
- Beginning and end of a string: \A and \Z
Exercises
1. Given the regular expression „(R.:$|^R.:)“ and the string ‚Ra: or Re:‘. What does the regex match?
2. I want to match ‚Re:‘ at the beginning of a line even if it comes
with a reply counter e.g. ‚Re[2]:‘. With what we have learned about
regular expressions so far: what is the regex for doing that?
3. Let’s try to DIY a regex that matches ‚Re‘ at the beginning of a line or ‚)‘ at the end of a line.
4. What do these search patterns mean?
a) „^“
b) „^x$“
c) „^$“
1. Example
The regex matches ‚Ra:‘. We expected that, didn’t we? The regex matches the alternative which comes first in the string.
2. Example
Ooops, the solution of the second exercise already looks quite professional, doesn’t it: „^(Re|Re\[\d\]):“ Ok, may be you have something different; something that looks a bit simplified like: „^Re(|\[\d\]):“. It is a good example because simplified version shows an absolute void as the first alternative in the brackets – the ‚|‘ symbol has nothing to the left of it other than the open bracket that starts the „sub-string“.
3. Example
„(^Re|\)$)“ is one solution. You didn’t forget to escape the bracket, did you? Fine, well done *g*. Now, if you can, try this one in the regex-tester with the following string: ‚Re[2]: bladibla (was: more bla)‘. You will see that the regex exactly matches just ‚Re‘ because at this point the regex machine returned TRUE for the match. If the beginning of the string is changed to something else only then will the regex match the bracket.
4. Example
The
first pattern searches for any text that has the beginning of a line or
that starts at the beginning of a line. This would include any text –
even a void line would be matched.
The second pattern just looks for a single x character that is alone in a line.
Last, but not least, the third pattern: it searches for lines that have a
beginning and an end, but nothing else: these are void lines!