01.04 – Regex-Beispiele (engl.)

Examples for various regular expressions

Finally I would like to give some examples of regular expressions that could be helpful in everyday use and that provide examples of some of the syntax elements. These examples have been taken from various Internet sources or purpose built.

Check mail-address

[\w-]+(?:\.[\w-]+)*@(?:[\w-]+\.)+[a-zA-Z]{2,7}

This Regex matches about 99% of all common email-addresses. There is no way to put together the definitive regex that will match all permutations (although Jeffrey Friedl published one: 6 kB long): it would end up much too complicated.

Check for valid IP-address

(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])
\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)
\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)
\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])

[Note: the regex is wrapped due to layout reasons. All must be used as a single long line!]

The form of this regex should be familiar to you from when you solved one of the problems in the tutorial: check for a valid date. We divided the whole problem into smaller chunks that did the checking for us. Some combinations are not allowed (0.0.0.0 or any value bigger than 255) and you the IP address may be differently written (192.0.0.1 or 192.000.000.001): in both cases the regex will only match valid addresses.

Check for multiple matching recipients in an email address

((.*?)@.*?,\s*)(\2@.*?(,\s*)?){2,}

This check assumes that the recipients are listed in the header in the following way:

„person@web.com, person@gmx.ch, person@weissnicht.de“

All characters preceding the @-character are stored in subpattern 2 ((.*?)@.*?,\s*). The second part of the regex (\2@.*?(,\s*)?) uses the result of the first match and checks any following addresses using a back-reference to subpattern \2.

This regex won’t help if the recipients are written in the following way

„Person 1 <person@web.com>, Person 1 <person@gmx.ch>, Person 1 <person@weissnicht.de>“

In this case you should use the more generic regex:

 (((.*?)\s*<)?(.*?)@.*?>?,\s*)(((.*?)\s*<)?\4@.*?>?(,\s*)?){2,}

Beware: the back-reference has to be tweaked! The recipient within the email address is now matched using the fourth parentheses and is therefore backreference \4. In both regexes the number in the curly brackets represents how often the recipient will appear in the header.

Check for multiple matching domain names in email address

(.*?(@.*?),\s*)(.*?\2(,\s*)?){3,}

This is much more likely to happen than multiple matching names:

Albert.a@web.com, berta.b@web.com, charlie.c@web.com, dora.d@web.com

These will be matched with the given regex that, once again, uses a back-reference.

Conditional check for digits in the subject line

What this means is that only those mails that have several digits somewhere in the subject line are matched. This is quite often used to identify spam. But: an active eBay-seller or -buyer usually receives mails with a 10 digit article identification number in the subject line that would than be recognized as spam. Therefore we need a regex that uses the condition: “find several digits in the subject line unless preceded by the word ‚article’”.

  .*?(?<!article) \d{5,}

This is a look-behind assertion: it matches any 5-digit string unless preceded by ‚article‘.

Check for several consonants in a row

Some spammer use different mail addresses on different domains. The characteristic attribute is the fact that the address has quite a lot of consonants in series: fndghklxrrstU@example.com
If you don’t want to kill all mails from this domain on the server with the selective download filter you need another approach to this problem. One way is to define a character class that will allow all characters but no vowels:

 (?i)[^aeiou]{5,}.*@

Let’s assume that a more or less correct address does not have more than five consonants in a row then this regex should help to reduce spam. Attention: this may not work in all languages. Please check this for relevance to your own language.