01.02.06 How to use Regular Expressions in TB!

Finally, we can try to use our new language in TB. First of all we have to know which tools are available to work with regular expressions. These tools are TB’s macros.

Macros

Not all of TB’s macros support the use of regex. Most of the macros have nothing to do with regex, but you can use regex on them to extract or modify the information.
And that is one feature of TB that makes it so powerful.

The first macro we will look at is: %REGEXPTEXT=“regex“ What does it do? It searches for the pattern „regex“ within the original text of a mail and returns the matched characters. The syntax is quite straightforward, look at the following example: %REGEXPTEXT=“[\d\.]+“

This macro used in a quick template and applied to a mail returns digits and dots.

Let’s have a look at a fairly similar macro: %REGEXPQUOTES=“regex“ This macro does exactly the same as the first one except that the returned text is not plain text but quoted text.

That was nice and easy. But when it comes to the extraction of text from the header of a mail (kludges) or address book entries we need to combine some macros:

The first one we will need for that is %SETPATTREGEXP. It is used to define the search pattern in the way %SETPATTREGEXP=“regex“. „regex“ is the regular expression you created to match the text.

The second one is %REGEXPMATCH. Again, this is easily defined: %REGEXPMATCH=“string“ with „string“ being any text. It can be a template, which means that any generic text can be used, so almost any TB macro can be used to provide the text here.

The definition of a regex through %SETPATTREGEXP is valid unless it is overwritten by a second appearance of a %SETPATTREGEXP. This means you can use the same pattern on several different generic texts in one go.

Before we have a look at another example I have to correct something. Did I say the syntax is quite easy earlier in this chapter? Well, that’s true as long as one only looks at one macro. But let’s see how this changes when we let the macro parse some text:

We already know the macro %REGEXPQUOTES. This could be written in a different way. Let’s assume that we receive Mails from a feedback form. Part of the content is „newsletter: yes“ or „newsletter: no“. We would like to create an autoresponder that uses exactly this information in a reply template, for example:
„Thank you for filling out our feedback form. You entered ’newsletter: yes/no‘. Are you sure?“
You can create more sophisticated text and a better filter to use different templates for the reply, but for the moment let’s stick to this example;-).

The macro %QUOTES defines what text is to be used as quoted text in a reply. The only problem is that we have to tell %QUOTES which text should be used. After that we can copy it to the reply template, add our standard text and save it.

Ok, first the regex: „^newsletter:\s*(yes|no)“. This has to be defined by %SETPATTREGEXP=“^newsletter:\s*(yes|no)“.
We already know that %REGEXPMATCH applies the search pattern on any generic text, so we need a macro that provides the original text of the mail and that is %TEXT. Now we have to put it all together and create a template that uses the macros in the correct order.

The only thing that makes it difficult to use these macros are the „-characters which are used as delimiters for the definition part. In %SETPATTREGEXP the search pattern is defined between these and in %QUOTES the text that will be inserted as quoted is defined. Once you start to combine the macros you have to tell TB which „-character is delimiter of which macro: the first macro must know whether the second „-character is the end of the macro or the beginning of the second macro. The same applies at the end of the second macro and so on. This can be achieved by doubling the „-character (escaping) or using different delimiters.

Simply, this looks like:
%M1=“%M2=““Def2″“%M3=““Def3″““. This is getting a bit confusing and hard to follow, so we could instead say:

%M1=“%M2=’Def2’%M3=’Def3′“. The example above would look like:

%QUOTES=“%SETPATTREGEXP=’^newsletter:\s*(yes|no)’%REGEXPMATCH=’%TEXT'“

This example could be written in a simpler way:
%REGEXPQUOTES=“^newsletter:\s*(yes|no)“, but this is because we extracted text out of the original text with %TEXT.

Next comes a macro combination that allows the extraction of several parts of the text. We know that we could define subpatterns in the regex by grouping sections with parentheses. We must now find a way to address them within TB.

TB provides a macro for this %REGEXPBLINDMATCH=“string“. But this does not return anything useful. Of course, we wanted to extract parts of the text not the whole text itself. So we still need a macro that allows us to tell the macro which of the subpatterns are to be used. And this is %SUBPATT=“n“. ’n‘ denotes the n-th subpattern in the regex.

Now this combination will be quite difficult to read and understand. So I will explain it using an example and will generate the whole macro combination bit by bit. After that I will combine everything.

From the original date of a mail we want to extract the year, two digits only, and use it as quoted text. The date is provided by %ODATE. The regex is „\d{2}(\d{2})\b“. That means we want to extract only two digits if they are preceded by two digits and followed a word boundary. Thus the first macro is: %SETPATTREGEXP=“\d{2}(\d{2})\b“.

The text that is used to find the date is defined using the macro %REGEXPBLINDMATCH=“%ODATE“. We are looking for the first subpattern, so %SUBPATT=“1″.

Now we put all together, we don’t forget to use the alternate ‚-characters:

%QUOTES=“%SETPATTREGEXP=’\d{2}(\d{2})\b’%-
%REGEXPBLINDMATCH=’%ODATE’%SUBPATT=’1′“

[Note: the regex is split using the %- macro and can be entered as two lines!]

Another example? There is a regex for reply templates that modifies the name of the recipient. Instead of ‚Gerd Ewald‘ we would like to have ‚Gerd Ewald at TBUDL…..‘ Well, we could download this regex somewhere, but let us try to create it ourselves.

%OFROMNAME will give us the name.

The reply address is given by %OREPLYADDR. We will extract the list’s name with a regex. Usually the name of the list precedes the @-character: %SETPATTREGEXP=“(.*?)\@“

This is used in combination with %REGEXPBLINDMATCH=“%OREPLYADDR“ of which we only want subpattern one : %SUBPATT=“1″

The result is then the contents of the TO-field. Watch out, before you can enter text this field has to be cleared. This is done by an initial assignment which is void.

%TO=““%TO='“%OFROMNAME at %-
%SETPATTREGEXP=_(.*?)\@_%-
%REGEXPBLINDMATCH=_%OREPLYADDR_%-
%SUBPATT=_1_“ <%OREPLYADDR>‘
[Note 1: the regex is split using the %- macro and can be entered as seen! Note2: the regex makes use of a feature of recent versions of TB where any character may be used as a quoting delimiter, in this case the underscore and single quote as well as double quote. Users of earlier versions will have to resort to using the clumsier double delimiter syntax]
The original reply address has to be added enclosed in „<>“-characters at the end.

As you can see, the syntax is quite easy and stereotypical. The only difficult thing is to find out which macro provides the necessary information and how to extract it with the regex.

Here another example that is available at Marck’s FAQ-page at www.silverstones.com.

%WRAPPED=’Historians believe that on %ODATE%-
%SETPATTREGEXP=“(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:%-
[\d]{0,2}\:[\d]{0,2})\s*?(.*))“%-
%REGEXPBLINDMATCH=“%HEADERS“ , at %SUBPATT=“3″[GMT%SUBPATT=“4″]%-
(which was %OTIME where I live) you wrote:’%-

Here, once again, the %- macro is used to make the whole combination easier to read.
This has no special meaning except that it tells TB that the following line should be treated as a continuation of the first line. The %WRAPPED means that the result of the macro combination will be word wrapped at the defined column in TB.

What does the macro do?

The first part „%WRAPPED=’Historians believe that on %ODATE%-“ is just some kind of a link up: on every reply the date of the original mail should be added to the text ‚Historians believe that on ‚.

The second part contains the regex that is much more interesting to us (I deleted the %- macro to show the regex in one line):

„(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:[\d]{0,2}\:[\d]{0,2})\s*?(.*))“

The option multiline is switched on and DotAll is switched off: (?m-s)
Then the regex looks for ‚Date:‘, which may be followed by any number of whitespaces. Due to the greediness of the star a question mark follows. The author escaped the colon with a backslash that isn’t necessary. I don’t know why he did that but it won’t cause problems, so we’ll leave it alone.

Now the first parenthesis follows. There is no need to group this part and I assume it is done for easier reading. You may delete it but then bear in mind that the total number of subpatterns has changed.

The second parenthesis looks for anything that consists of four digits. We know that the regex will look in the kludges (%HEADERS) for the date. So we guess that the author will look for something like ‚year‘. This may be followed by whitespaces.

Now we come to the third parenthesis. This is the one the author needs. He searches for three numbers with zero, one or two digits. These numbers are separated with colons. That is obviously the time. Whitespace may follow and with the fourth subpattern all of the rest is matched: this is nomore than the GMT-information.

A closer look on the regex shows that it is applied to the header lines and only that only subpattern three and four are really needed.

The result could be:

‚Historians believe that on Sonntag, 7. April 2002 , at 11:22:59[GMT
+0200](which was 11:22 where I live) you wrote:‘

It works although the layout would need a bit DIY.

Other Possibilities to Use Regular Expressions in TB

There are other possibilities for using regex in TB than macros.

For example the text search option for in the mail editor. It is especially useful to search for strings in long mails with the special features that regex offers.

Find Text

This window may be opened with Ctrl-F or using the ‚Edit Find‘ menu entry in the mail editor. Just enter the regex in the text line. Don’t forget to check the ‚regular expressions‘ box in the Options section.

In almost the same way I can search for text within stored mail, I can search text mails in folders using regex. Just press F7 while in folder view. This opens a search window, which offers the facility to search for text in mails. In the ‚Options‘ tab panel you can enter the regex in the ‚Search for‘ field.

Message Finder

Goto the ‚Advanced‘ tab panel and check „Regular Expressions“.

Advanced Tab of Message Finder

You can use regex in filter conditions to optimise the organisation of your inbox. This is a field where regex are as efficient as in macros. Go to the ‚Account, Sorting Office/Filters‘ menu item. Open the filter definition, go to the ‚Options‘ tab panel and check ‚Regular Expressions‘.

Sorting Office – Filters

Overview and Summary

What did we learn in this final chapter?
There are several ways to use regular expressions in TB, which are:

  • TB offers macros that can use regular expressions to find, extract and modify mail text:
    • %REGEXPTEXT=“regex“: returns the matched string within a mail as text
    • %REGEXPQUOTES=“regex“ returns the matched string within a mail as quoted text
    • %REGEXPMATCH=“string“ defines the generic text in which the regex should match the specified string and return the matched text. Any macro or text may be used for ’string‘
    • %REGEXPBLINDMATCH=“string“ is used to define the generic text in which the regex should match the specified string. It does not return any text. The %SUBPATT is needed to return the text. Any macro may be used for ’string‘.
    • %SETPATTREGEXP=“regex“ defines the regex for %REGEXPMTACH and %REGEXPBLINDMATCH. The definition is valid if not overridden by a subsequent %SETPATTREGEXP
    • %SUBPATT=“n“ returns the n-th subpattern when used with %SETPATTREGEXP and %REGEXPBLINDMATCH
  • You can use regular expressions to look for specific messages as well as to search strings within mails. Furthermore you can use them for defining filters.

Exercises

1. You remember the regex we wrote to clean the subject line? „^Re(.*?):\s*(.*?)\s*(\(was:.*\))*$“ . Try to improve this one: instead of ‚(was:xyz)‘ PGP-users will find ‚(PGP Decrypted)‘. The regex should find these kinds of subject as well. Furthermore the regex should be available within a reply template.

2. In the last chapter I described a macro that modifies the TO-address for mailing lists:

%TO=““%TO='“%OFROMNAME at %-
%SETPATTREGEXP=_(.*?)\@_%-
%REGEXPBLINDMATCH=_%OREPLYADDR-%-
%SUBPATT=_1_“ <%OREPLYADDR>‘

Try to change it in such a way that it is no longer necessary to use %REGEXPBLINDMATCH and %SUBPATT but %REGEXPMATCH. You will need to modify the regex. Hint wanted? Ok: The subpattern was created because otherwise the @-character would have been included in the match. The only thing you have to do is to find a regex that does not match the @-character and has no subpattern.

1. Example

Well, that is not too difficult. You only expand the last part of the regex with an alternative

„^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$“

But now we have a look at the template. We would like to create a new subject. The macro we need is %SUBJECT because we use it when we reply to a message and we want to have a proper subject line it should start with:
%SUBJECT=“Re

Then we add the regex:

%SUBJECT=“Re: %SETPATTREGEXP=““^Re(.*?):\s*(.*?)\s*
(\was:.*\)|\(PGP Decrypted\))*$““
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]

We will apply it to the original subject %OFULLSUBJ and need the second subpattern. %SUBPATT=“2″

%SUBJECT=“Re: %SETPATTREGEXP=““
^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$““
%REGEXPBLINDMATCH=““%OFULLSUBJ““%SUBPATT=““2″““
[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]

Ok, that’s it. Additional exercise: what if the subject does not have any ‚Re‘ but ‚AW‘, ‚FWD‘ or anything else? Go, try to add further alternatives at the start of the regex.

2. Example

A positive lookahead assertion will help „.*?(?=\@)“ The assertion will look for the @-character but won’t include it in the match. Therefore, the template is easier to write:

%TO=““%TO='“%OFROMNAME at %-
%SETPATTREGEXP=_.*?(?=\@)_%-
%REGEXPMATCH=_%-
%OREPLYADDR_“ <%OREPLYADDR>‘

next