- Home
- /
- 01 – Reguläre Ausdrücke
- /
- 01.02 – Regular Expressions...
- /
- 01.02.04 Special Elements Part...
Quantifier, Subpattern, Grouping of Elements
Everything we’ve had so far hasn’t been too difficult. But this chapter is heavy stuff. Please, do me a favour: read this chapter carefully. Be patient! Try everything with the regex tester; get familiar with the elements in this chapter: they are the essential for creating proper regex. Although this may be a bit more complicated than the chapters before, it is certainly more interesting 😉
Quantifier
We
already know to define patterns for matching single characters, groups
of characters, character classes or ranges of characters. We can use
alternatives in our search patterns. But something of absolutely vital
interest is missing – the ability to define repetitions.
You remember the example that was a regex to search for the European formatted date:
„\d\d\.\d\d\.\d\d\d\d“
For every single digit we wrote „\d“. Isn’t there another way, much
simpler than repeating the metacharacter as often as the regex wants to
find the character? Yes, there is! There are quantifiers!
+ * ? are the most important quantifiers.
The „+“-character means that the character preceding the plus-sign has
to appear at least once at the specific point of the string. „fo+l“
matches ‚fool‘, ‚fol‘ and ‚foooool‘.
„Re:\s+“, for example, means that at least one whitespace has to follow ‚Re:‘ to be matched.
I hear some of you experts: yes, the usage of quantifiers is not only
restricted to characters. You can use them to repeat metacharacters,
character classes and some other elements we are yet to learn. 😉
The star „*“ represents any number of occurrences of the preceding
character at the specific point in the string. ‚Any‘ really means ‚any‘,
even if the character doesn’t appear at all. Ooops, what’s the use of
that?
Well, let’s have a look at the following example: „Re:\s*\w+“
Huh, that already looks as cryptic as those regex the experts use <g>. What does this regex mean?
Search for a ‚Re‘ followed by a colon. Then any number of whitespace
characters may appear – even no spaces at all. What for? In proper
subject lines there should be a space. But imagine we would like to
match any subject string even if someone modified it manually and
deleted the space. We have to tell the regex that there might or might
not be a space. Anyway, both possibilities should be found. This can be
done with the star as quantifier. Well, finally, there has to be at
least one alphanumeric character.
Caution: the meaning of this quantifier is sometimes misinterpreted.
Look at the following task: a regex has to be defined that matches only
lines of a string with only digits in it. One solution I saw was:
„^[0-9]*$“
But this regex matches void lines as well; the star stands for ’no
digit‘ as well as for ‚any digit‘. So the regex machine returns TRUE
when no digit is in a line. If you want to make sure that there is at
least one digit in a line you have to use the plus-sign: „^[0-9]+$“.
The question mark means that the preceding character may appear once or
not at all at the specific point of the string. A bit like the star only
that the number of occurrence has the maximum ‚1‘. „h..?s“ matches
‚hers‘, ‚hips‘ and ‚his‘ or ‚has‘. Within ‚house‘ it matches ‚hous‘;
within ‚hose‘ it matches ‚hos‘.
There is another way to define repetitions: „{x,y}“ This is a way to
explicitly define how many repetitions of the preceding characters you
want. In this formula ‚x‘ denotes the minimum number and ‚y‘ the maximum
number necessary for the preceding character. „\d{2,4}“ means that only
two to four digits in a row are matched.
If you omit the second number ‚y‘ but leave the comma in the curly
brackets „{x,}“, then there is no upper limit and the minimum is x-times
the preceding character. „\w{3,}“ matches any string with at least
three word-characters.
If you omit not only the second number but the comma as well „{x}“, then
this means the exact number of appearances of the preceding character.
„\d{6}“ matches exactly six digits. This quantifier gives us a new way
to write our regex that matches European formatted dates :
„\d{2}\.\d{2}\.\d{4}“
The three quantifiers I introduced at the beginning of this chapter are simply special ways to write one of the following regex:
{0,1} = ?
{1,} = +
{0,} = *
Before I can tell you more about quantifiers and what has to be kept in
mind when using them, I have to introduce parentheses (round brackets)
as a grouping device.
Grouping of Elements, Subpattern and Quantifiers again
Grouping of Elements
In
the chapter about alternatives, the parentheses crossed our way for the
first time. They were used as they are in maths: common parts of the
pattern are written outside the round brackets.
Now we will learn something new: we can use the parentheses to group
parts of the regex to be dealt with as a single element of the pattern. A
following quantifier is applied to the grouped part of the regex. E.g.:
„foo(bar)?“ matches ‚foo‘ and ‚foobar‘
Another example:
„Re\s*(\[\d+\])?:“ There it is again, the reply counter in a subject
line. This time it looks already quite professional. First of all we
look for ‚Re‘. After any number of whitespaces (or none at all) digits
in square brackets may follow. This part is grouped. Finally there has
to be a colon.
Let’s have a closer look at the regex: why is it defined in that way?
First the whitespaces: we don’t know whether the author of the subject
line inadvertently added one or more spaces after the ‚Re‘. Even if he
did nothing and left the string untouched we want the Regex to match the
string. Well, I agree, there shouldn’t be any space, but you never know
… 😉 That’s why we use „\s*“ at this point.
Then the digits in square brackets: we allow any number of digits in the
square brackets by using the plus-sign as quantifier. But there has to
be at least one digit! Because there is no upper limit for this
character, the way to infinity is free <vbg>.
Finally the counter ‚[#]‘ itself: this part is grouped. This element
need not appear in the string to result in a successful match. That is
why we use the question mark.
The regex therefore will match:
‚Re:‘
‚Re [1]:‘
‚Re[123]:‘
It will not match ‚Re[]:‘. Something to think about and to try on your
own: what has to be changed so that the regex matches this one?
Ok, here is the solution: replace the ‚+‘-sign in the square bracket with a star: „Re\s*(\[\d*\])?:“
Within ‚Re [1]: [3]:‘ it matches ‚Re [1]:‘. It does not match the second
reply counter. Ok, if we want to find such awful subject lines we have
to work on our regex a bit more: it should match any number of counters
that may have colons and -you never know – that may or may not be
followed by spaces. Finally the last character has to be a colon or
none: „Re\s*(\[\d+\]:*\s*)*:?“
Well, it is possible for a subject to begin like that although there is
only a small probability that it will really happen. I can envisage many
of combinations of reply counters. The regex does not match all of
them. If you want to have the regex match other combinations, go ahead,
try it! Test it with a regex of your own making, but: there is one major
point you should keep in mind. There is no perfect Regex. The more you
try to improve the regex to match even more possibilities and
combinations of characters, the more complicated the result will be. You
will have to pay for this kind of perfectionism: either you won’t be
able to read your regex anymore or the Regex will become buggy whenever
you make even the smallest change to it. It is easier to live with some
erroneous matches and to sort them out manually than to create the
perfect Regex. Jeffrey Friedl published a regex to match email-addresses
in „Mastering Regular Expressions“: it is more than 6000 bytes. It was a
good example of being too perfect, as he stated.
Ok, back to the job-in-hand: let’s have another example of how to group
elements. We had a pattern to match European formatted dates:
„\d{2}\.\d{2}\.\d{4}“ As you can see, the beginning „\d{2}\.“ is
repeated. Right, so this can be simplified: „(\d{2}\.){2}\d{4}“ The
first part, now grouped in parentheses, has to appear twice. This is for
example ‚01.02.‘. This is not an optimal version of the search pattern:
day and month numbers still have to be two digit numbers and silly
values for both are still allowed. But wait; you will get your chance.
Let us learn some more elements before you are given the job of
optimising the pattern in an exercise <g>.
Subpattern
Grouping
with parentheses has another effect in regexian that is widely used in a
lot of regular expressions in TB. Characters that were found due to a
grouped pattern or element are stored in a temporary variable for
further use. These variables are known as a subpattern (SubPatt in TB).
We should have a look at an example to help us understand that:
‚bill.doors@macrohard.com‘
We use the regex „(\w+)\.(\w+)@.*“. The first parentheses matches
‚bill‘, the second one ‚doors‘. These two are each now stored
respectively in subpattern 1 and subpattern 2.
Or:
„(\d+\.)(\d+\.)“ When the string is ‚22.05.‘ then ’22.‘ is stored in subpattern 1 and ’02.‘ in subpattern 2.
How do I find out which is the first subpattern? Well, in our simple
examples it is obvious: everything that is matched by the first pair of
round brackets goes to subpattern 1, the second pair returns subpattern
2, etc
But what if the regex looks like: „Re\s*(\[(\d+)\])*:“ The part that is
enclosed by the first opening bracket and its corresponding closing
bracket is stored in subpattern 1. The part that is enclosed by the
second pair starting at the second opening bracket is stored in
subpattern 2. With ‚Re [4]:‘ our example would result in:
Subpattern 1 = ‚[4]‘
Subpattern 2 = ‚4‘
Important: each opening bracket creates a new variable or subpattern. If
you want to avoid this, you have to insert „?:“ just behind the opening
bracket: „(?:…)“ is a grouping which does not store the matched string
in a subpattern.
What does the regex-machine store in a subpattern when a quantifier is
applied on a grouped element? Example: „(\d{2}\.){2}\d{4}“
If the string is ‚23.05.2002‘ the first pattern is matched at ’23.‘. Now
the regex machine goes on to find the same pattern in the string a
second time. If successful the matched characters are stored in the same
subpattern. In other words: the second match overwrites the first one.
In our example the subpattern will show ’05.‘
The regex-tester shows the contents of each subpattern: with every
subpattern it will offer another tab panel. That one with ‚0‘ on it
shows the whole match, while that with ‚1‘ on it shows the match of the
first subpattern, etc.
And Quantifier again
Ok,
now let’s move on to some special behaviour relating to quantifiers,
Some of them have a ‚human‘ peculiarity: they are greedy! You don’t
believe that? Well, look at the following string <g>:
„The abbreviation ‚ISP‘ stands for ‚Internet Service Provider‘.“
We want a regex that finds the text that is enclosed by inverted commas and stores it in a subpattern:
„(.*)'(.*)‘.*“
Nothing difficult really: find everything that comes before an inverted
comma, then everything in between and finally everything that follows…
And? Did you try it on the regex-tester? What is in subpattern 2?
„Internet Service Provider“. Ooops, I expected „ISP“ because it comes
first in the string. 😮 It is quite obvious that the first group (.*)
greedily matched most of the string and left only what was at least
necessary for subpattern 2 to match the whole string. Furthermore, the
last element „.*“ in the regex allowed ’nothing‘ or void to follow.
Keeping this in mind: this part leads to a successful match even if
nothing is to be matched. The star stands for as many appearances as
there are or none at all!
Ok, here’s another example:
We want to extract as many parts of an email-address as possible. We’ve
already got a solution for the first part, the name; but that wasn’t a
good one because it only allowed word characters. We have to make this
more generic. Let’s take (.*) for the first part. The second part is
some text delimited by a dot. But this may appear more than once before
the @-sign ends the name section. The Regex should therefore find the
following examples of addresses:
‚1234abc@mail.com‘
‚1234.abc@mail.com‘
’12-34.abc.def@mail.com‘
So, the regex starts with „(.*)\.?(.*)*@“. After that any text may
follow, possibly delimited by more dots. We will ignore this for the
example and go for extracting only that text that comes last after the
last dot, so that the regex does not get too complicated. This should be
done with „(.*)\.(.*)“
„(.*)\.(.*)*@(.*)\.(.*)“
What do we expect in the subpatterns when ’12-34.abc.def@mail.com‘
Subpattern 1 = ’12-34′ ?
Subpattern 2 = ‚.abc‘ or ‚.def‘ or ‚abc.def‘ ?
Subpattern 3 = ‚mail‘ ?
Subpattern 4 = ‚com‘ ?
Ask the regex-tester:
Subpattern 1 = ’12-34.abc‘
Subpattern 2 = ‚ def ‚
Subpattern 3 = ‚mail‘
Subpattern 4 = ‚com‘
Subpattern 1 contains almost the all of the first part, subpattern 2
only the last three characters before the @. Of course, we expected
that, didn’t we? We already know that the star is greedy: it stored as
many characters as it could into the first subpattern.
Caution: not only stars, I mean star-signs are mean and greedy <vbg>, the plus-sign is as well! Don’t forget that!
Let’s take another string to test the regex:
’12-34.abc.def@mail.test.com‘. Now the star in the third parentheses
„(.*)“ is greedy and ‚eats‘ almost everything after the @ up to the last
dot, storing ‚mail.test‘ and not ‚mail‘.
How can we avoid that? We are going to learn another meaning of the
question mark (Calm down, this is only the second one. There are many
more to come and you will eventually come to understand why a regex is
full of these funny question marks *g*): just add a question mark to the
greedy pattern and you make the pattern less greedy.
Let’s do that. We add a ?-sign to the first pattern:
„(.*?)\.(.*)@(.*)\.(.*)“
Subpattern 1= ’12-34′
Subpattern 2= ‚abc.def‘
Subpattern 3= ‚mail.test‘
Subpattern 4=’com‘
For a better understanding I shall try to explain what the regex-machine
does: the regex-machine does not restrict the greediness of the (.*).
In the moment it discovers the pattern (.*?) the following happens: it
stores as much as possible into this subpattern. Then it steps back one
character at a time to find a point where a successful match is found.
I’m going to explain it using our example regex „(.*?)\.(.*)*“ and the
string ’12-34.abc.def‘. The Regex machine stores ’12-34.abc‘ into the
first subpattern. This is the maximum that the Regex allows because a
dot and some text follow this string. But now the machine realizes that
there is a question mark, which suppresses the greediness of the first
subpattern. Thus, it steps back one character before the ‚c‘ and checks
whether or not the Regex leads to a successful match. No, it does not.
So, again, take one step back and a check again. Still no hit. Back
again to a position before the ‚a‘. And now the machine realizes that
this would lead to a successful hit because of the preceding dot. The
machine takes the position exactly before the first dot. In reality, it
would have to do some more back-stepping to find out that this position
is the last one possible with the minimum of characters for a successful
match. But I reckon we’ve looked deep enough in to the way it works for
now.
Back to our first example where we wanted to match text between inverted
commas. The regex was „(.*)'(.*)‘.*“ and the text „The abbreviation
‚ISP‘ stands for ‚Internet Service Provider‘.“ Let’s alter the Regex to
„(.*?)'(.*?)‘.*“
Both grouped elements need a question mark otherwise „ISP‘ stands for
‚Internet Service Provider“ would be stored in the second pattern. To
add a question mark in the second element alone wouldn’t help very much
because the first (.*) remains greedy.
Overview and Summary
This
was a quite difficult section. Not only for you to read and understand.
No, it was even difficult to write and create the text, from which I
hope you got some idea. This section covers one of the basic elements of
regexian that you will need in every Regex.
The following elements were presented:
- Characters that repeat preceding characters are called quantifiers:
+ the preceding character must appear at least once
? the preceding character may appear once or never
* the preceding character may appear in any amount of times or never - There are quantifiers that allow to define exact ranges of the frequency of the preceding character:
{x,y} the preceding character has to appear at least x-times but not more than y-times. One may omit parts of the range: {x,} stands for at least x-times with no maximum. {x} means exactly x-times. - Parentheses are used to group multiple character sequences into patterns so that we can apply quantifiers to them. „(ab)+“ means that the combination of ‚ab‘ has to appear at least once to be matched.
- Patterns in parentheses are stored in variables for further use. These variables are called subpatterns in TB. In the case of multiple parentheses where groups are grouped, the outer subpattern contains all inner subpatterns. Furthermore, the first opening round bracket creates the first subpattern, the second defines the second subpattern and so on.
- Quantifiers with no upper limit may be greedy in some search patterns. + and * after a dot make the regex take in as much as it can to lead to a successful match. (.*)(.*) will include the whole match in subpattern 1 and nothing in subpattern 2.
- A greedy pattern can be made ungreedy by adding a question mark to it (.*?) In the first step it still will match all that is possible but then it will do some backstepping to give back one character at a time until the minimum characters that constitute a successful match are reached.
Exercises
1.
The last regex we created for searching European format dates was:
„(\d{2}\.){2}\d{4}“ It wasn’t perfect because it didn’t allow single
digit days or months nor two digit years to be matched (D.M.YY or any
other combination). That’s worth an making into exercise, isn’t it?
2. You’ve got the solution for question 1? Ok, that solution is quite
interesting but now we can try to write an improved Regex for matching
European formatted dates.
If possible we would like to allow only combinations of digits that look
like a terrestrial date. Well, we do not want to exaggerate: it’s ok if
the Regex matches February, 29th (29.02.) even if it isn’t a leap year
;-).The only important points are: it should be in the format DD.MM.YYYY
or D.M.YY or any combination and it should be restricted to dates that
exist.
3. Imagine you receive bug-reports via an on-line system. The reports
are standardized and all have the same format (more or less). We need a
regex that extracts the more important information. The reports look
like:
Sender: firstname.lastname@agency.com
Date: TT.MM.JJJJ
Report-no.: xyz123
Please try to define a regex that extracts the following parts into subpatterns: first name, last name, agency, date, report-no.
4. Write a regex that matches the time in the form hh:mm:ss. Make sure that only valid combinations are returned.
1. Example
„\d{1,2}\.\d{1,2}\.(\d{4}|\d{2})“
You created something else? Doesn’t matter, it may be a correct solution: there is often more than one way to do it!
„(\d?\d\.){2}(\d{4}|\d{2})“ is in my opinion an elegant solution. A not
so good idea is something like „\d{2,4}“ for matching the year: it
allows three digit years.
2. Example
This is a bit tricky. In these cases I like to divide the problem into smaller chunks. Which days are possible:
a) 01-09, the preceding zero could be missing.
b) 10-29, all months of a year have at least 29 days. Ok, there is one
error we are allowed to make: February only has 29 days in leap years.
We will assume this is ok, otherwise it might be almost impossible to
create the Regex.
c) 30, all months except February
d) 31, only January, March, May, July, August, October, December.
Possible numbers for months are 01-10 (the preceding zero might be
missing) and 11, 12. We want to allow two or four digit years. In case
of four digit years we only accept those that start with 19xx or 20xx
Ok, now we have what we need. Let’s start:
Case a) and b) combined with the allowed months gives us:
„(0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\.“
Case c) with all possible months:
„30\.((0?[13-9])|(1[0-2]))\.“
And finally case d) with possible months:
„31\.(0?[13578]|1[02])\.“
Now the years:
„(\d{2}|(19|20)\d{2})“
The first three parts have to be alternatives whereas the pattern for
years is mandatory. To avoid that the Regex matches within a longer
sequence of digits to find something that only looks like a date, we
envelope the whole Regex with \b metacharacters. That should give
„\b(((0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\.)|
(30\.((0?[13-9])|(1[0-2]))\.)|(31\.(0?[13578]|1[02])\.))
(\d{2}|(19|20)\d{2})\b“
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]
Incredible: that’s a cracker! You found something different? Even
something better? Well, I think that is ’normal‘. You can always write a
regex in another way to give the same result. And of course: you can
improve almost every Regex. My Regex only shows one way to approach the
problem: the way I like to do it. I hope you were able to follow my
thinking.
3. Example
This is not very difficult. Again, divided into chunks of the whole problem:
First name and last name can be extracted from the mail-address. „Sender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*“ should be sufficient.
The question mark in the second subpattern might be redundant because
the @-character follows anyway. But it won’t hurt anyone, would it?
Date: phew, we are in luck. The format is mandatory. We don’t have to use the killer regex of problem 2 ;-):
„Date:\s*((\d{1,2}\.){2}\d{4})\s*“
And now the report number:
„Report-no.:\s*(.*)“
To make sure that the regex checks the whole string we add \A at the beginning and \Z at the end.
„\ASender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*Date:\s*
((\d{1,2}\.){2}\d{4})\s*Report-no.:\s*(.*)\Z“
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]
Subpattern 1,2,3,4 and 6 will contain the information we wanted.
4. Example
I
think we have already had some practise at dividing bigger problems
into smaller ones. The time-problem is another one. It should be mere
routine now. And, it is much easier than it looks at first sight,
because the format is fixed!
Hours are from 00 to 19 and 20 to 23 (24 equals 00!!):
„([01][0-9]|2[0-3]):“
Minutes and seconds have the same format and the same combinations of digits, 00 to 59:
„([0-5][0-9]:){2}“
Altogether, enclosed by word boundary (\b) metacharacters:
„\b([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\b“
next