A brief tour of regular expressions

I have been using regular expressions for the better part of my entire programming career. It started out when I was very young, probably around 12, solving these programming "puzzles" called regular expressions. Over the years I have learned a lot about them, and today I'm going to share this knowledge with you.

Disclaimer

There are no shortcuts. I idle the regex channel on freenode and hundreds of people on a daily basis pop in and out asking for help with regular expressions. They typically don't understand the regex they receive, and worse yet - this is probably used in production code somewhere. The only way to become good at regular expressions is to actually learn and understand how they work. At the very least you could understand what the tokens represent. I do recommend you find yourself a playground, such as regex101, where you can play around and experiment.

Protip: Check out the quickref and explanation on regex101

With great power comes great responsibility

Before I dive into the exciting details about regular expressions, you need to know when they are suitable to use and when they are not. You typically hear that you should not parse HTML with regex, which you should not. The reason is because HTML is an irregular language and regular expressions are designed to match (you might have guessed it) regular languages. Or, if you prefer more formal terms: In the Chomsky Hierarchy, regular languages are considered type 3 languages, while context-free languages (such as HTML) are type 2. Type 3 languages are a subset of type 2.

This does not mean that you can never use regex to parse HTML. If you just want to grep something in a document or find something quick and easy they do the job just fine. Or if you can guarantee that the input looks in an expected fashion, then they also work great. But other than that there are better tools for the job (such as an HTML parser).

This restriction is not limited to just HTML though. There are many scenarios where regex is not the right tool for the job, such as parsing source code or an email.

Side note: some regex engines today are so powerful that they can in fact parse HTML with great success (such as Perl or PCRE, see this stackoverflow post), but you should still not do it.

The regex engine

Before you become a regex expert you need to understand how the regex engine works. This is what parses your regular expression and in turn tries to match it against a given string. There are hundreds of different engines out there, and I can not cover them all, so I will limit myself to what the vast majority of languages use: backtracking engines. This is what you'll find in Perl, ECMAScript, PHP, Ruby, Java, etc. These engines will parse your expression from left to right, and the leftmost alternative will have precedence over the rightmost. That means you can "weigh" your expressions. More on that later.

Some languages, such as golang, use non-backtracking engines. The benefit here is that they are guaranteed to finish in a linear time, but have limited functionality (for better or worse). You cannot, for example, use backreferences or lookarounds. But thats a topic for another time. Maybe.

Backtracking? Wha?

Backtracking simply means taking a step back. This is what the engine might do when it fails to match a certain regex token at a specific position. You typically want to minimise the amount of backtracking, since this can be very time consuming, and sometimes result in what's known as catastrophic backtracking.

You have probably written countless expressions that unintentionally utilise backtracking. Lets say you have a regular expression where you want to match any string ending in x, being 2 or more characters long including the x. You might write something like this:

my $regex = qr/^.+x$/;  

When the engine tries to match it against the string any string ending in x it will actually perform one step backtracking. You can see this behaviour live here (see step 4).

The reason this happens in this case has to do with greedy quantifiers (the + in the regex). Lazy quantifiers on the other hand will do the complete opposite - they will match as little as possible.

Protip 2: You can see how many steps your regular expression takes to match against a given string on regex101

How to minimise backtracking

It is always beneficial to limit the amount of backtracking your pattern can perform, and to know in which scenarios it can happen. Otherwise you might end up with cases where your expression matches something unexpected, or even worse, eats up all your CPU time.

First step is understanding what you actually want to match. You should almost never use the dot (.) in your regular expressions. Do you really mean to match anything except line terminators? Or can it be written in a better way? I typically recommend negated character classes, as they are more verbose and allow you to be more fine grained in what you want to match. The dot roughly translates to [^\n].

Instead of writing /^.+?,\n, you can write /^[^\n,]+,\n/, which will perform a lot better. I encourage you to try both patterns out and see for yourself.

Some patterns however can be especially dangerous, such as nested quantifiers. If you end up with a pattern where you see this, you need to be extra careful. What I mean by this is if you have a group that contains repeated tokens, and then you repeat the group. This creates a lot of combinations that might take very long for the regex engine to exhaust. If you have this, you need to make sure that the engine only has one way to match the same match, because it will try every possible combination.

The typical example, that results in catastrophic backtracking, can be found here. It might not look so bad, but when you open up the debugger and take a look at whats going on it becomes abundantly clear: the engine is trying to match any amount of as, repeated any amount of times, at every position in the string.

I'll go ahead and explain the first few steps the engine performs in the regex101 debugger:

  1. The engine begins matching.
  2. The engine asserts we are at the start of the string
  3. The engine enters the group
  4. The engine finds the token a+ and matches it accordingly
  5. The engine proceeds to the next token (of which there are none)
  6. The engine asserts all alternatives in the group have been exhausted
  7. It tries to assert we are at the end of the string, which we are not, since there is still a !
  8. The engine steps back one step from the end, and tries matching again.
  9. Go back to step 3

The amount of steps will increase exponentially with the length of the string.

In some languages you can remedy this with tokens that limit backtracking, such as possessive quantifiers or atomic groups. These tokens tell the engine to not retry a match at a different position if they fail to match, or to not give up anything they have consumed. This can be seen here. These tokens however are quite advanced and thus out of scope for this article.

What's the point?

So far I have only discussed the potential downsides of the backtracking engine. But trust me, there are plenty of benefits.

The greatest benefit is that you can capture arbitrary data from the input string and match it any where else in the pattern. This is done using backreferences (which as I mentioned is not available in non-backtracking engines). This functionality can be used to, for example, match repeated characters in a string:

# \1 references the data captured in the first group by the character class
my $regex = qr/([a-z])\1/;  

Some languages, such as perl and pcre, allow for even more advanced functionality such as recursion. This in turn can allow regular expressions to match and validate nested structures.

These engines also support what is known as lookarounds. These tokens allow you to assertively peek either forwards or backwards in the string and perform matches that do not consume the data they match. For example, remove all single digits that don't have a following exclamation mark.

(my $new_string = $my_string) =~ s/\d(?!!)//g;

But even here you might get bitten with backtracking. Lets say you want to extend the regex above to remove sequences of digits which don't have an explanation mark following them. You might be inclined to write something like

(my $new_string = $my_string) =~ s/\d+(?!!)//g;

But that doesn't quite work out like you might expect... I'll leave this one as an exercise for the reader.

Some useful design patterns

By now you should have gathered some insight into how the engine works, and some of the pros and cons of the backtracking engine. But before I leave you, I'm going to show you some useful design patterns that might help you with your regex riddles in the future.

Replace everything, unless...

I find this pattern to be quite useful. It can be quite tricky to replace X but not Y, when they are a subset of eachother, as there is no real way of negating an expression. Lets say you want to remove all the strings that aren't quoted digits. Your intuition probably tells you use negative lookarounds or maybe negated character classes. In this particular scenario it might work, but generally it never works out well though. So one way to think about is to match everything you don't want to match, and everything you do want to match - and then replace both of those matches with what you don't want to match (remember I mentioned something about weighing expressions earlier?). Let me show you with an example.

Using the example and the little trick above, your regex would look something like this:

(my $new_string = $my_string) =~ s/("\d+")|"[^"]+"/$1/g;

Because the leftmost alternative will be parsed first, it has higher precedence than the rightmost, which means the engine will first try to match the quoted digits, before matching everything else. You can see the result visually here. This works because the first capture group will be null in the cases where a quoted digit was not matched, and thus not affect the substitution result.

Match everything until...

Sometimes you want to match everything until you reach a certain character or string. If its just a character, you can easily get away with using a negated character class. If you want to match, say, everything except a comma, that would just be:

my $regex = qr/^[^,]+/;  

However, lets extend this example to match everything except the string foo. You can no longer use a negated character class, since that only takes individual characters and not strings.

"Aha! I should use a negative lookahead!" you say, and whip up something like this:

my $regex = /^.+(?!foo)/;  

But quickly realise that does not work. It still seems to consume foo?? If we open up the debugger on regex101 we can see why. The greedy quantifier is consuming everything, leaving nothing for the negative lookahead to check. If we make the quantifier lazy, it simply only matches the first character and exits.

What we need to do is somehow, at every position in the string, check if that character has the string foo following it. If not, continue until we do or we are at the end of the string. Exactly this can be accomplished with the following expression:

my $regex = qr/^(?:(?!foo).)+/;  

If you open up the debugger you can see how it works under the hood. What you also notice is that this is quite intensive, but typically not enough to yield any problems. A regex engine can work through tens of thousands of steps in milliseconds. Unless you notice a performance problem there is no need to worry.

Match string that don't contain...

Sometimes you want to do a filtered search, maybe look for all log entries that don't contain WARNING but do contain ERROR. This can easily be achieved by using a negative lookahead. Remember, everything inside a lookaround is just another regular expression, with the same rules applying like anywhere else. You can thus write something like this:

my $regex = qr/^(?!.*WARNING).*ERROR/;  

The negative lookahead will work through the entire string, looking for the string WARNING. If there is no match, the engine continues and looks for the string ERROR. If there is a match for WARNING, the execution will be halted there will not be any overall pattern match.

Match recursive structures

This is not as much a good to know design pattern as it is a showcase for the strength of regular expressions. Nevertheless, you might find this useful (if your language supports it).

Believe it or not, you can actually match balanced brackets in any given string with pure regex. This does however require some advanced knowledge of regular expressions. Languages such as perl and pcre allow for recursion in the regex.

The following expression will match an entire sequence of balanced (and potentially nested) brackets:

my $regex = qr/\((?>[^()]+|(?R))+\)/;  

Besides using nested quantifiers (which as you know by now, can be very dangerous), we also use an uncommon token (?R). This token recurses the entire pattern when encountered. You can open up the debugger and see how the engine works through it.

In some cases you might want to retrieve each individual string in the parenthesis. This can be achieved with using a positive lookahead, which as you remember, has the neat feature that it does not consume what it matches (see it here):

my $regex = qr/(?=(\((?>[^()]+|(?1))+\)))/;  

I will leave it up to you, the reader, to figure out exactly how these work and why its possible.

I hope you found this article useful. Feel free to leave a comment with your thoughts below, and ideas of what I should write about in the future.

Firas Dib

Read more posts by this author.