Parsing regular expressions: the tale of regex101.com

When I first began working on regex101.com, my first challenge was to create a robust regex parser which would lay the foundation for almost all the other features on the site. This proved to be harder than I had first thought, both to do well and in a maintainable fashion.

The regex parser is probably the most re-written part of the code, but with each iteration, it has only gotten better. While I don't have any local copies of the old versions, you can head on over to the web archive and travel back in time to see both how the design and code has evolved.

Scope

The first challenge arose from the fact that I wanted to support the entire PCRE syntax, and not only Javascript or any other subset. There are, and were even then, many websites that allow users to test their regular expressions; some of the nicer ones even highlighted your regex. But were they all limited to Javascript. I didn't feel that any one of them was good enough.

This is where I wanted to make a change. I chose PCRE because it is one of the most widely used libraries, is a superset of most other flavors, and is much more powerful than the Javascript engine.

I wanted to create a platform where users could learn and take full advantage of the power of regular expressions. With time, I extended my feature set to include even more languages, which resulted in even more complexity. But more on that later.

In this post I'm going to try to limit my discussion to only the regex parser and how it evolved over time as I worked on the website. I might write more posts in the future about other parts of the site, if there is any interest.

First iteration

I head into this project with limited knowledge as to how you were "supposed" to parse complex structures:

  • How should the code be structured?
  • How do I keep the code easy to maintain?
  • How should I keep complexity low as syntax becomes ever more complex?
  • How should I include support for multiple languages with different rules?

My very first attempt was a basic parser that would step through each character in the string, using a style I was familiar with from previous endeavours.

var currentChar;

while (currentChar = stream.next()) {  
  if (currentChar === '\\' && stream.peek() === 'w') { 
    // ...
  }
}

This quickly turned into spaghetti-like code and I realised it would never work. I abandoned the idea and went back to the drawing board.

Second iteration

I read up a bit online, and thought, maybe a recursive descent parser could work here. It would make matching brackets and parenthesis easy and seemed like a solid concept.

I wrote my new parser according to all the rules in the book, mutually recursive procedures, maintainable code, well commented code, etc. But as I added more and more functionality I noticed something: It was slow. And kept getting even slower. The parsing itself worked well, so on that count, it was good. But because performance was so bad, and seeing as this was going to run on every keystroke, it was not viable. I had to scrap it. Good 'ol (and painful) ⌘ + A + Del.

I had now been introduced to a new variable: performance. I had limited experience in web applications previously, having only written typical websites and not interactive applications of this nature. I was quite naive and still not familiar with the performance aspects of Javascript, and certainly not how to write web-performant code.

Third iteration

While I sat pondering what I should try next, I remembered that I was quite proficient in regular expressions. Could I use one, or even several, regular expressions to parse a regular expressions?

Yes... yes I could.

This is where things really kicked off. I used my regex knowledge along with the excellent Regex.prototype.exec in Javascript to parse the input. I was at this point I stumbled upon a basic regex parser for Javascript, written by Steven Levithan, which worked in a similar fashion. This laid the foundation for my parser and I took inspiration from Steven's work. Performance was good and the code seemed easy enough to maintain (even though it grew much larger and much more complex than Steven's quite rapidly).

var regexTokenizer = /.../g;  
var match;

while (match = regexTokenizer.exec(input)) {  
  if (match[0] === '\\w') {
    // ...
  }

  // And so on
}

This was the first working parser I had come up with until now. It was by no means perfect: buggy, hard to maintain, and clumsy. But it was good enough for now and I decided to release the first version of the site.

Evolution

As time progressed I added more features and fixed bugs. Time eventually came when I wanted to support more flavors. Since each flavor has its own quirks, I wanted to give the user the best experience and allow them to test in the flavor where they were actually going to use the regex. The first flavor to be added to the list was Javascript, and soon after Python.

The regexTokenizer regex would return a string representing a token and put it in the match variable which I then used to compare against known strings and in turn tokenize the regex accordingly. The code was riddled with if-statements trying to figure out what the token meant, and everything was tailored to PCRE. For example, PCRE would support negative lookbehinds and several escape sequences, which Javascript did not. The easiest fix for me at the time, being quite exhausted and bored from all re-writing, was to just guard each if statement. Since Javascript was a subset of PCRE, I simply had to modify the relevant if-statements:

if (match[0] === '\\K' && flavor !== Flavors.JS) {  
  // ...
}

Quick and easy. It worked... right?

Seeing as how I never seem to able to leave things alone, I wanted to expand it further. Now I added the Python flavor. This was also a subset of PCRE, so the majority of the work was done, but it still had a few quirks and re-writing all the if statements to accommodate Python was not a sustainable solution in the long run:

if (match[0] === '\\K' && (flavor !== Flavors.JS && flavor !== Flavors.PYTHON)) {  
  // ...
}

So I decided to write a little helper which would read off of a pre-defined table. This table would contain all the possible tokens and which flavor supported what:

var supportTable = {  
  '\\K': [Flavors.PCRE],
  '\\w': [Flavors.PYTHON, Flavors.JS, Flavors.PCRE],
  // ...
};

function checkFlavorSupport(token, flavor) {  
  return supportTable[token].indexOf(flavor) !== -1;
}

This was the method I used for a couple of years, up until the new version of the site was released, where I re-coded everything from scratch. But we'll get to that.

If anyone is curious, one of the regexTokenizer for PCRE (I had multiple for each language) ended up looking like this in its final moments:

var regexTokenizer = /\[\^?(?:\\Q\\E)?]?(?:\\Q(?:(?!\\E)[\s\S])*(?:\\E)?|\[:\^?[^:]*:\]|[^\\\]]|\\[\S\s]?)*]?|\\(?:N(?:{\w*}?)?|k(?:'(?:\w+'?)?|<(?:\w+>?)?|{(?:\w+}?)?)|g(?:[1-9]\d*|{(?:-?[0-9]+}?|(?:\w+}?)?)|<(?:[+-]?\d+>?|(?:\w+>?)?)|'(?:[+-]?\d+'?|(?:\w+'?)?))|Q(?:(?!\\E)[\s\S])*(?:\\E)?|[0-3][0-7]{1,2}|[1-7][0-7]|0|[1-9][0-9]{0,2}|x(?:[0-9A-Fa-f]{2}|{[0-9A-Fa-f]+})?|c[\u0000-\u007F]|[pP](?:C[cfnos]?|L[lmotu&]?|M[cen]?|N[dlo]?|P[cdefios]?|S[ckmo]?|Z[lps]?|\{\^?(?:\w+\}?)?)|[\S\s]?)|\((?:\?(?:\(DEFINE\)|<?[=!]|'(?:\w+'?)?|P?<(?:\w+>?)?|(?:[imsxXUJ-]*:|[imsxXUJ-]*\))|(?:P[=>]\w*|[0R]|[+-]?[1-9]\d*)\)?|\#[^)]*\)?|[!|>:]|&\w*\)?)?|\*[:\w]+\)?)?|(?:[?*+]|\{[0-9]+(?:,[0-9]*)?\})[+?]?|[^#.?*+^{$[()\\/|\r\n](?!(?:[?*+]|\{[0-9]+(?:,[0-9]*)?\})[+?]?)|[\S\s]/g;  

Fun, right? No, not at all.

I knew what I had would not survive the test of time. It was around 2500 lines of spaghetti code with surgically inserted if-statements to handle quirks for each flavor and whatever else there was. Even then, it was riddled with bugs. It was completely unmaintainable and I dreaded each time I had to debug it or fix something.

But it worked good enough for most people, which gave me some space to work on other features of the site, such as the regex explanation, unit tests, regex debugger, library, etc. The new features and multi-flavor support attracted more and more users and the site kept growing in popularity. But at the same time, my interest in it was diminishing due to the fact that I knew I had accumulated a lot of technical debt.

The next step

I knew this could not go on, and the only solution was to restart from scratch. The github issues were piling up and more and more bugs were being discovered.

After having ignored it for way too long, I made some time, and started tapping away at the new version. Slowly but steadily it was getting there. This time I was not going to repeat my errors of the past.

I re-wrote the parser, still based off of the same concept as before (using exec), but a bit cleaner. I separated concerns a bit better, split things up into different files, tried to keep things tidy. But once again, I fell into the same trap. It became spaghetti.

I started over, again, before wasting more time.

The final version

I had to scrap everything, and I meant everything. I had to find a new method all together, and rethink the whole process. One late night it hit me, and I started by identifying the root issues and solving them one by one.

String management

Instead of using exec and one regex to step through and tokenize the input string, I created my own StringStream class which would keep track of where we were in the string and expose a set of string functions to manipulate it. These functions allowed me to match a String or RegExp object against the input, which would be consumed as it were matched.

To put it in code:

const string = 'foo bar baz';

const stringStream = new StringStream(string);

// Match and consume `foo` at the start of the string
stringStream.match(/^foo/);

// The string now looks like this: ` bar baz`

// Proceed and match ` bar`
stringStream.match(' bar');

// The string is now: ` baz`
// etc...

This was the foundation for which the tokenizers would work off of.

Tokens

In previous iterations, each token was simply represented by an object which was returned by the parser. The object would have a type which would be something predefined in a global RegexTypes object. It would look something along the lines of:

return {  
  type: RegexTypes.GROUP,
  pos: {
    start: 0,
    end: 3
  },
  quantifiable: true
  // ...
};

This was not very maintainable nor DRY. I had to manually specify fields and if I wanted to return the same type of token somewhere else, I would have to copy it entirely. Instead I decided to create a base Token which would contain all the default information for any given token, which all other tokens would extend. It would try to localise and keep the common data in one place, so each token only had to worry about what was important to it.

Some of the things it handles:

  • What position it was matched at
  • What its string representation is
  • What character code it has (if any)
  • If its quantifiable
  • And much more

Each flavor could now easily implement its own specific token without having to re-implement a bunch of basic fields and features.

Tokenizers

Each flavor implemented its own tokenizer, which defines an array of functions. These functions are what actually runs matches against the string and figures out what is what. The order these functions are added to the array, is the order they will be executed in. It is crucial these functions are executed in a specific order, since certain tokens have precedence over others.

The functions are given two crucial arguments:

  1. The state
  2. The string stream

The state is cloned upon each iteration, and can be modified freely by any token created. The state keeps track of anything you can think of:

  • The parent of this token
  • How many capturing groups there are
  • Which named subpatterns exist
  • If we are inside a lookbehind
  • The previous token
  • etc

It would end up looking something like this:

class PCRETokenizer extends Tokenizer {

  getTokenizers() {
    return [
      this._parseMetaEscape, 
      // this._parseLookaround,
      // ...
    ];
  }

  _parseMetaEscape(stream, state) {
    if (stream.match(/^\\[...]/)) {
      return new PCREMetaEscape();
    }
  }

  // ...
}

I still use plenty of regular expressions in my parser to parse any regular expression, but now they are split up into tiny manageable pieces instead of being lumped together into one large huge mess like before.

The Regex Parser

The final piece to the puzzle was the RegexParser. This is what tied everything together; it would execute the functions returned by the tokenizer against the StringStream and return a parse tree.

It only exposes one public method, parse, which is used as follows:

 const regexParser = new RegexParser();

 const tree = regexParser.parse({ 
   flavor: Flavors.PCRE,
   delimiter: '...',
   // ...
 });

 // Now we can use `tree` to colorise the regex or explain it

This is what regex101 currently runs. The code is performant, easy to extend and has proven very easy to maintain with a minimal amount of headaches. Today the website supports the following flavors:

  • PCRE
  • Javascript
  • Python
  • Golang

And, as far as I can tell, completely bug-free with 99% of the PCRE spec covered.

If there is anything you'd like me to write about in the future, feel free to send me an email or a tweet and I'd be happy to look into it.

Thanks for reading!

Firas Dib

Read more posts by this author.