01.02.07 Final Conclusion

Now let’s try to explain the example that was given in chapter 1.

%QUOTES=“%SETPATTREGEXP=““(?is)
(—–BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?
(.*?)(^(- –|–\n|—–BEGIN PGP SIGNATURE)|\z)““
%REGEXPBLINDMATCH=““%text““%SUBPATT=““3″““

It starts with %QUOTES=. The text that is matched with the following regex is to be used as quoted text.

„%SETPATTREGEXP=““ defines the regex:

(?is)(—–BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?
(.*?)(^(- –|–\n|—–BEGIN PGP SIGNATURE)|\z)
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]

You know already why there are doubled „-characters: it is to escape them so that they are not taken as part of another macro by mistake, (although you also know there are better ways of writing that too).

„(?is)“ is the options setting: ignore case and assume the whole text as one single line, furthermore let the dot match newline characters.

„(—–BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?“
This opens the first subpattern. The regex says: find five hyphens followed by the string BEGIN PGP SIGNED. This may be followed by any character sequence or none at all (.*?). Due to the greediness of .* it is restricted by a question mark. Next is a following new line (\n).

The new line starts with the string ‚Hash:‘, any character sequence and ends with a new line again. This is the second subpattern and it may appear once or never. Any number of whitespace characters may follow the second subpattern. Then the first subpattern is fully defined by the final parenthesis. Again this is followed by a question mark: that means that the first subpattern may appear only once or not at all.
These lines are created by PGP or GnuPG when a message is clear signed. The text is standard and therefore it is easy to define the regex. But the author of that macro combination not only wanted to use it on PGP-signed messages: he or she wanted to use it even on text that hasn’t been touched by PGP and therefore do not have these lines.

„(.*?)“

This is the important third subpattern: the unmodified message text itself. The preceding regex was necessary to locate and isolate this subpattern. The regex just says: „Find anything, no matter what, but don’t be greedy.“

Now the alternation starts:

„(^(- –|–\n|—–BEGIN PGP SIGNATURE)|\z)“

Subpattern 4 starts and looks for a beginning of a line. Anything we now define in this subpattern has to be at the beginning of the line „(^“. Then subpattern 5 follows:
„(- –|–\n|—–BEGIN PGP SIGNATURE)“

It consists of three alternatives:
„- –“ resp. „–\n“ or „—–BEGIN PGP SIGNATURE“

The first alternative is well known once you have seen a clear-signed PGP message : it is the modified signature separator that PGP uses with the extra hyphen and space as an indicator to show where it inserted its own lines. Quite unfortunate really, but we won’t discuss it here. Just let’s take it as is.

The second alternative is the original signature separator. That means that this will be found if the text had no contact with PGP. Actually, it’s not quite right, because the proper cut mark is dash-dash-space-newline, so this regex should be:
„(^(- –|–\s\n|—–BEGIN PGP SIGNATURE)|\z)“

The third alternative is necessary to look f“(^(- –|–\s\n|—–BEGIN PGP SIGNATURE)|\z)“or lines that contain the PGP-created hash (ok, ok, there is only a part of the hash, but this is a regex tutorial and not a PGP tutorial.). This is the end of subpattern 5’s definition.

The second alternative of subpattern 4 „\z)“ searches for the end of the string as a counterpart to subpattern 5’s search for the beginning of a line. Therefore there doesn’t have to be a signature separator or a PGP-hash: The mail just has to end somewhere…

To be honest: the author looks for this funny ending of the mail only because of the fact that the proper text of the mail should be easily located and extracted. There is no further interest in these parts.

Now the next macro follows: %REGEXPBLINDMATCH=““%text““, which lets the machine apply the regex to the text.

The %SUBPATT=““3″““ macro returns the proper part of the mail to the %QUOTES variable.

That’s it.

A tutorial that is entirely written without direct feedback was something new to me: you don’t notice when it gets too complicated or too academic. I tried to avoid both and I tried to concentrate on those elements of regular expressions that are most useful. I really hope I was successful and that it wasn’t boring 😉

The tutorial isn’t a perfect and full description of regexian. If I wanted to offer that I could have copied J. Friedl’s book into TB’s help file. No, the tutorial was meant to give an idea, an initial help to get started. Like any other language you will only learn the vocabulary by doing and using it. If I was able to give you a hand to get started I’m content!

I would like to thank those who helped to convert my ideas into something readable and useful. My special thanks go to Marck who was very patient and who improved my translation. Thanks to (in alphabetical order):

Januk Aggarwal
Bert Bohla
Dirk Heiser
Hanja Nowicka
Peter Palmreuther
Marck D. Pearlstone
Stefan Peukert
Alfred Rübartsch
Andreas Rumpenhorst
Ingrid Spitzer
Carsten Thönges
Karin Uhlig
Arnd Wichmann