- Home
- /
- 01 – Reguläre Ausdrücke
- /
- 01.02 – Regular Expressions...
- /
- 01.02.05 Special Elements Part...
Options, Assertion, conditional Regex, Backreference
Did you think chapter 6 would be more complicated than chapter 5? No, calm down: chapter 6 will deal with some elements that may be a bit more complex but on the whole these elements are not too difficult to learn.
Assertion
When
I wrote the German version of this chapter I had something to start
with: what does assertion mean? This question was possible because there
is no German word for this aspect of the terminology of regular
expressions: it’s simply called ‚assertion‘. Even a look into Friedls
book didn’t help: he doesn’t use this expression. He calls this element
of regexian „lookahead“. Ok then, let’s look at what an assertion can do
for us in regular expressions.
An assertion can be used to find out whether characters precede or
follow the matched part of a string. Well, that’s nothing new. We could
do that without an assertion. But: the assertion checks the characters
without including them in the match: it does not „eat“ the characters.
Let us explain this with an example:
We
want to match the string ‚foo‘ within a string only if it is followed
by ‚bar‘. But do not want ‚bar‘ to be part of the match! Without an
assertion we would have used „foobar“, but this regex matches ‚foobar‘.
If we use an assertion „(?=“ the regex looks like „foo(?=bar)“ The match
now is ‚foo‘.
Of course there are negative look-aheads or assertions „(?!“ too.
Ok, let’s try this one on our more or less senseless example:
„foo(?!bar)“ means that ‚foo‘ will be matched except when ‚bar‘ follows
but only ‚foo‘ is the resulting matched string. So it matches ‚foo‘
within ‚foolish‘ while ‚foobar‘ doesn’t match at all.
But be careful, there is a trap: one might think that „(?!foo)bar“ only
matches ‚bar‘ if ‚foo‘ does not precede it. Wrong!! This regex matches
‚bar‘ in any event with no exception. This assertion is a look-ahead one
that only looks ahead to find ‚foo‘ ahead of the point where ‚bar‘ is
found. But there will always be a ‚bar‘ and never a ‚foo‘. Ok? Did you
understand this?
What we need is a look-behind assertion! And, as if by magic, here is
one: „?<=“ is a positive look-behind assertion and „(?<!“ is the
negative version.
Example:
„(?<!foo)bar“ matches ‚bar‘ if ‚foo‘ does not precede it
„(?<=foo)bar“ matches ‚bar‘ only if ‚foo‘ precedes it.
You can use assertions in alternatives, e.g.: „(?<=proba|possi)bility“
So, ‚bility‘ is matched if either ‚proba‘ or ‚possi‘ precedes this. But
it would match ‚bility‘ as well if ‚impossi‘ preceded the string. But I
think you expected that ;-))
There is a restriction to the search patterns allowed in assertions: the
length must be absolute, which means, you can’t use quantifiers.
„(?<=\d+,)\d\d“ would produce a syntax error.
The different branches of alternatives in assertions may have different lengths, but they have to be predefined!
Furthermore: different (but still defined) lengths of branches are
permitted at the top level: „what’s that?“ Have a look at the example:
„(?<=ab(c|de))“ is not permitted, „(?<=abc|abde)“ is permitted.
The second opening parenthesis makes us leave the top level, causing the
assertion to become unpredictable for the regex machine.
You may use assertions in a sequence and you may nest them:
Example: „(?<=\d{2}\.\d{2})(?<!00\.00)\s+Payment“ matches
‚payment‘ if preceded by any amount as long as the amount is not ‚00.00‘
Or: „(?<=(?<!im)possi)bility“ matches ‚bility‘ only if preceded by ‚possi‘. But it won’t match if preceded by ‚impossi“.
Backreference
In
an earlier chapter we learned something about subpatterns. These were
characters that were stored in some kind of variables. I wrote „… are
stored in a temporary variable for further use…“. Well, now we will see
what ‚further use‘ means:
Let’s take a regex to explain what I mean: „(sens|respons)e and \1ibility)“
This regex will find either ’sense‘ or ‚response‘. Whatever it matches
is stored in the first subpattern (you remember – the first opening
parentheses?). Then it has to be followed by ‚ and ‚. Next comes „\1“.
This means: use the content of the first subpattern as part of the
search pattern. Because this is followed by ‚ibility‘ the regex matches
either ’sensibility‘ or ‚responsibility‘ whatever was found at the
beginning of the string. So ’sense and sensibility‘ or ‚response and
responsibility‘ would have a successful match but never ’sense and
responsibility‘.
Some restrictions:
the backreference must not appear in the subpattern it is related to:
„(a\1)“ would never give a positive match. On the other hand, if the
subpattern is followed by a quantifier, it is allowed: „(da|de\1)+“ This
matches ‚dadadada‘ or ‚dadeda‘ or ‚dadedadadada‘.
Conditional Regular Expressions
This
is an element that is not very common: conditional regular expressions.
The principle is: „If pattern A is found, look for pattern B; if not
then look for pattern C“.
The correct syntax is „(?(condition)Yes-Pattern|No-Pattern)“ or „(?(condition)Yes-Pattern)“
But there is a restriction to the condition pattern: it has to be either a sequence of digits or an assertion.
What is it good for? Let’s assume we have to extract a date from some
text. But for whatever reason it could be a European DD.MM.YYYY or
English DD, MMM YYYY formatted date. We only know that at the beginning
of the line there is either ‚Datum‘ for the European (German) version or
‚Date‘ for the English one and the date terminates the line.
What we want is a regex that matches the English formatted date if it is
preceded by ‚Date‘, otherwise it should match the European formatted
one:
„(?(?=^Date)Date:\s(\d+),\s([A-Za-z]{3})\s(\d{4})$|
Datum:\s(\d{2}\.)(\d{2}\.)(\d{4})$)“
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]
At the beginning of this chapter I told you that the condition has to be
either an assertion or a sequence of digits. But these digits must be
backreferences. The condition wouldn’t literally search for digits. So
what does it all mean?
Example: we receive mails with ‚Name‘ in the first line followed by a
value that is that name. The name may change between mails, but we know,
that the name will occur again within the mail, and when it does it is
related to an attribute we want to extract, let’s call it the shoe size
„Name:\s*(.*)?$“ will find the name. This is followed by something we
are not interested in. But then the name appears again followed by a
colon and the shoe size, which are digits:
„.*?(?(1):\s*(\d+))“ Both parts combined:
„Name:\s*(.*)?$.*?(?(1):\s*(\d+))“
Have a try with the following text:
‚Name: James Herriot
Bladibla
James Herriot: 9′
(Note: if you use the regex tester you have to switch on the Singleline option. We will learn about that in a while)
Options, Modifier
We’ve
had quite a lot of new vocabulary to learn in regexian. Now let’s learn
something about modifiers. What do they do? Well, they modify
something. But what? Elements of regexian are modified by modifiers. I
can hear you shout:“Oh no; after I’ve learned all that about regex and
so many different elements „. Don’t worry: I am not going to explain all
the possible modifiers or options; we will restrict ourselves to those
that are essential and most important.
Look at the following options:
[Ann: there are more Options and Modifier. Please read the German
version of the tutorial for non-TB-Users for more information. Sorry, no
translation in the moment available, but I’m sure I will do this if
requested ;-)]
i for Caseless
The regex machine is forced to ignore the case of letters. It will
search for the pattern ignoring the case (upper/lower) of the pattern
m for Multi-line
The regex machine usually takes the string as a whole line, no matter
whether there any newline characters (\n). The circumflex „^“ that
indicates the beginning of a line only matches the beginning of the
string and the dollar „$“ matches the end of the string instead of
matching the end of a line or a terminating newline at the end.
Go ahead, test it with the regex tester: uncheck the Multi-line option. Then enter the following wrapped text:
‚This is a test that
has several
lines with a test
at the end of a line‘
and the regex „test$“. Nothing will be matched. Switch multi-line on and ‚test‘ will be matched.
When this option is active the regex machine will indeed recognize each
newline character; the text now consists of multiple lines. This is
important when we are going to check the entire text of a message in one
hit.
s for DotAll
As we learned in one of the first chapters, the dot matches any
character other than the newline character. Once this option is set, the
dot matches newlines as well.
But this is not actually the whole truth: the newline will also be
matched by all negated character classes that do not include the
newline, e.g.: „[^x]“ matches everything except the character ‚x‘: that
includes any newline.
x for Extended
When this option is enabled the regex machine ignores any whitespace
character in a search pattern. Thus you are now able to include remarks
in the regex, wrapped in #-characters.
To search for whitespaces in this mode you have to escape them „\ “ or you use „\s“.
Furthermore you can define your own character class that searches whitespaces
e.g.: „[ ]“
If you have a look in the regex tester’s options menu you will find some
more options or modifiers. I don’t want to explain them all. There are
some special options that are explained in books or other tutorials.
They are not really necessary for a basic understanding of regular
expressions. The four I explained above will be useful to you and
sufficient for most purposes.
How do we switch them on? That is easy: you just enter the letter that
indicates the option in parentheses in which a question mark precedes
the letter: „(?“ and „)“. E.g.: „(?i)“ switches on ‚ignore case‘. You
may combine several options for example: „(?im)“ means ‚Caseless,
Multi-line‘. Furthermore you may switch options on or off:
„(?im-sx)“ switches caseless and multi-line on and Dotall and extended off.
If a characters appear before and after a „-„-character then the option
is switched off. You may use the options anywhere in the regex. They may
appear at the beginning as well as in the middle.
„(?i)Test“ is the same as „Te(?i)st“. In the case where an option is
switched on more than once in the top level part of the regex than the
machine will use the option that comes last in the search pattern.
Although you may enter the options at any point in the regex I recommend
that you do it at the beginning.
Well, there is no rule without an exception: if an option appears within
a subpattern it will only apply to the subpattern: „(a(?i)b)c“ matches
‚abc‘ as well as ‚aBc‘.
Something to think about:
„(a(?i)b|c)“ is the regex. Does it match a ‚C‘ or only a ‚c‘? It is obvious that it matches ‚aB’….
Specials
Ok,
let’s finish this chapter with some special elements and rules that
will cross our path in TB only every now and then. I don’t want to go
into details; this chapter is more like a glossary to look up if a regex
„behaves“ oddly.
Meta-Characters
In the first chapter of this tutorial I mentioned the meta characters
and listed ] and }. These two aren’t actually meta characters. You may
recall that I asked you to assume they were. If you searched for them as
literals you wouldn’t need to escape them. But, I always escape them to
keep my regex easy to understand. I avoid errors caused by
misunderstandings, which I will elaborate here:
Within a character class defined by „[“ at the beginning only the following characters are meta characters:
\ to escape
^ to negate a character class but only if the circumflex is the first character to appear within the class
– to indicate a range
] to terminate the character class definition
Ok, now let’s have a closer look to the following more or less senseless regex:
„[Y-]345]“ I wanted to define a range within the class that includes ‚Y‘
to ‚]‘ and the digits 3,4 and 5. But what happens? Does the regex match
‚Z34‘ or parts of it? No!
Instead try ‚Y345]‘ or ‚-345]‘. And here we are, it is matched. The only problem is… that is not what we wanted.
Ok, I am going to explain what happened: the first close square bracket
is interpreted as the end of the character class. The regex matches
strings beginning with ‚y‘ or ‚-‚ followed by ‚345]‘. Yes, „[Y-\]345]“
is the correct solution. What have we learnt? Although „]“ is not a
metacharacter outside a character class it’s a good idea to escape them
every time one searches for them as literals.
Square brackets
Let’s have a closer look at square brackets and special cases. Assuming
that caseless is switched on then „[aeiou]“ will match ‚A‘ as well as
‚a‘. But „[^aeiou]“ will match ‚A‘ only when caseless is switched off.
Numbers and Digits
We use \d to try to find decimal digits. Of course it is possible to
look for characters using hexadecimal or octal character codes. The
regex „\x09“ matches the character with the hexadecimal code 09.
Octal numbers are a bit more difficult : the syntax is quite easy
„\ddd“, where each d is a digit. The regex searches for the character
with an octal code of ‚ddd‘.
Or, and now it gets a bit tricky, for a backreference. The regex machine
takes any number lower than 10 as a backreference if the number is not
inside a character class. Inside a character class or if there are not
enough parentheses to define a relative subpattern the number is taken
as a pattern for octal codes.
Examples:
\040 is octal ’space‘
\40 is octal unless there are enough (more than 40) parentheses defining subpatterns
\6 is always a backreference
\11 could either be a backreference or a ‚tab‘
\011 is always a ‚tab‘
\113 is always octal, because there are no more than 99 backreferences allowed
And what is \0113??
Restrictions when using Regex
This is just to inform you about restrictions that we have to bear in
mind when using regular expressions. You and I, as ’normal‘ users, won’t
reach these limits of regexian but a Regex must not exceed 65535 bytes.
There are no more than 99 subpatterns allowed. The total number of
elements – like groups, assertions, options and conditionals – must not
exceed 200. Furthermore the length of the entire text that is checked
for the pattern is restricted as well, but we won’t reach this limit in
TB. It is restricted to the value of the system’s largest positive
integer. Because the Regex machine needs to reserve storage for
subpatterns and for quantifiers with undefined length due to recursive
processing the maximum length available will be reduced. But, to be
honest, this is not really of interest to TB-Users 😉
Overview and Summary
This
chapter explained some of the special features of regular expressions.
We should know enough about regex by now to be able to use them in TB’s
macros.
Short summary:
- an assertion can check whether characters appear in before or after a
search pattern without including these characters in the match. But
remember: You are not allowed to use quantifiers on them – that would
make them unpredictable to the regex machine. The assertions are:
- positive lookahead-assertion (?=
- negative lookahead-assertion (?!
- positive lookbehind-assertion (?<=
- negative lookbehind-assertion (?<!
- Strings that are matched as subpatterns are available for further use within the same regex. These backreferences can be addressed by „\#“ where # is a positional number indicating the parenthesis pair that defines the subpattern
- Assertions or Backreferences may be used to create conditional regex „(?(condition pattern)yes-pattern|no-pattern)“
- The vocabulary of regexian can be modified using modifiers or
options. They may precede the Regex in parentheses „(?modifier)“. We
discussed a few of them here:
- i for Caseless
- m for Multi-line
- s for DotAll
- x for Extended
- We learned that some characters become metacharacters in character classes and behave differently
- A regex can search for hexadecimal or octal character codes. Remember though that there is a possible conflict with backreference numbers.
- The length of a regex is limited because the result has a limited length. Even the text to which a Regex is applied must not exceed a certain length. This should only be of minor interest when using regex in TB
Exercises
1.
Try to define a regex that recognizes doubled words (e.g.: ‚the the‘)
Don’t forget that the second word may appear at the beginning of the
next line. The Regex should only match words and not parts of words (not
‚the theme‘) and it should ignore case.
2. Ok, now let’s try to write a simple version of a subject cleaning
regex. ‚Re‘ can be followed by anything and a colon. Then the original
subject appears. After that a space could follow and a former subject
enveloped in parentheses introduced with ‚was:‘ follows. We would like
to extract the original subject. This is a very simple version of a
cleaner.
3. We receive mail with order amounts of some product. We need the
integer of these amounts (without the decimals). The amounts may be
mailed in EUR or $. Well, the problem is that the symbol for the
decimals is different in both systems. Furthermore, the sign indicating
Thousands is different as well: #,###.##$ or #.###,##EUR. The regex
should know which of the versions it has to match.
1. Example
The most important hints were the word wrapping (multiple lines) and caseless. These are options which we switch on: „(?im)“.
Finding words should be easy: „[a-z]+“ We only look for words without
digits. We do not have to define capital letters because we already
switched „caseless“ on. But we only want to search for whole words: so
we have to use \b in front of the pattern. We can’t use \b at the end of
the word, because it is okay to end a sentence with a word and start
the next sentence with the same word. So we need to allow at least one
whitespace to follow. „(?im)\b[a-z]+\s+“ This is the beginning: we still
need something to find the second appearance. We have to store the
result of the first match in a variable to have a backreference. Ok,
second try:
„(?im)(\b[a-z]+)\s+\1“
But, this matches ‚the theme‘, which we didn’t want to have matched. But
this is easy now: after the second appearance nothing but a word
boundary may follow and that’s it: „(?im)(\b[a-z]+)\s+\1\b“
2. Example
This is a very simple subject cleaner.
What we defined above as a possible subject should look like ‚Re[2]: proper subject we want ;- (was: old subject)‘
The beginning of the subject can be matched by „^Re(.*?):“ – ‚Re‘ at the
beginning of the line, followed by anything or nothing if there is no
counter. We could have done this with „.*“. But unfortunately this is
greedy and might match more than we want.
The original subject can be found after the colon; to make sure that we
don’t extract redundant whitespace we match them first.
„^Re(.*?):\s*(.*?)“ The original subject is stored in subpattern 2.
Again there is a question mark to avoid greediness. What is missing? Ah,
something that matches the old subject: otherwise it would be included
in subpattern 2.
A whitespace and then an opening round bracket follows. Then, as we
said, it is introduced with ‚was‘. But this old subject could be missing
from time to time so we can’t insist on its appearance:
„\s*(\(was:.*\))*$“
Or as a whole:
„^Re(.*?):\s*(.*?)\s*(\(was:.*\))*$“
We use the „$“-character to make sure the regex reads the whole line. We
can do that because we may expect that the subject is a single line.
Those of you who aren’t sure should use „\Z“ for end of string instead.
3. Example
Ok,
this is an exercise that looks quite artificial. But sometimes one
needs stupid examples to clarify something (I remember some exercises in
physics when I studied that assumed „one-dimensional cattle“ or
„weightless Christmas bulbs“. These weren’t much cleverer than my
example ;-))
I agree that this could be done using a different regex but I wanted a conditional one:
First of all we need an assertion that looks for something, two digits
and a dollar sign: „(?=.*\d{2}\$)“ If this exists, it should be the
$-version: „([\d,]+)\.\d{2}\$“ I simplified the problem and defined a
character class that allows only digits and commas. (Well, here a
mismatch is possible when there is an arbitrary string with digits,
several commas then a dot and two digits followed by a dollar sign.
Hmmm, well, ok, you’re so clever? You improve it! *g*)
If there is no $-version the Regex should match the EUR-version:
„([\d\.]+),\d{2}EUR)“, which uses the same simplification as above.
The full regex should look like:
„(?(?=.*\d{2}\$)([\d,]+)\.\d{2}\$|([\d\.]+),\d{2}EUR)“. Did you notice
that the EUR-result is stored in the second subpattern while the
$-version is stored in the first one?