PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

The true power of regular expressions

Note: This article was originally published at Planet PHP on 15 June 2012.
Planet PHP

As someone who frequents the PHP tag on StackOverflow I pretty often see questions about how to parse some particular aspect of html using regular expressions. A common reply to such a question is:

You cannot parse html with regular expressions, because html isn't regular. Use an XML parser instead.

This statement - in the context of the question - is somewhere between very misleading and outright wrong. What I'll try to demonstrate in this article is how powerful modern regular expressions really are.

What does aoregulara actually mean?

In the context of formal language theory, something is called aoregulara when it has a grammar where all production rules have one of the following forms:

B - a B - aC B - Iu

You can read those - rules as aoThe left hand side can be replaced with the right hand sidea. So the first rule would be aoB can be replaced with aa, the second one aoB can be replaced with aCa and the third one aoB can be replaced with the empty stringa (Iu is the symbol for the empty string).

So what are B, C and a? By convention, uppercase characters denote so called aonon-terminalsa - symbols which can be broken down further - and lowercase characters denote aoterminalsa - symbols which cannot be broken down any further.

All that probably sounds a bit abstract, so let's look at an example: Defining the natural numbers as a grammar.

N - 0 N - 1 N - 2 N - 3 N - 4 N - 5 N - 6 N - 7 N - 8 N - 9 N - 0N N - 1N N - 2N N - 3N N - 4N N - 5N N - 6N N - 7N N - 8N N - 9N

What this grammar says is:

A natural number (N) is ... one of the digits 0 to 9 or ... one of the digits 0 to 9 followed by another natural number (N)

In this example the digits 0 to 9 would be terminals (as they can't be broken down any further) and N would be the only non-terminal (as it can be and is broken down further).

If you have another look at the rules and compare them to the definition of a regular grammar from above, you'll see that they meet the criteria: The first ten rules are of the form B - a and the second ten rules follow the form B - aC. Thus the grammar defining the natural numbers is regular.

Another thing you might notice is that even though the above grammar defines such a simple thing, it is already quite bloated. Wouldn't it be better if we could express the same concept in a more concise manner?

And that's where regular expressions come in: The above grammar is equivalent to the regex [0-9]* (which is a hell lot simpler). And this kind of transformation can be done with any regular grammar: Every regular grammar has a corresponding regular expression which defines all its valid strings.

What can regular expressions match?

Thus the question arises: Can regular expressions match only regular grammars, or can they also match more? The answer to this is both yes and no:

Regular expressions in the formal grammar sense can (pretty much by definition) only parse regular grammars and nothing more.

But when programmers talk about aoregular expressionsa they aren't talking about formal grammars. They are talking about the regular expression derivative which their language implements. And those regex implementations are only very slightly related to the original notion of regularity.

Any modern regex flavor can match a lot more than just regular languages. How much exactly, that's what the rest of the article is about.

To keep things simple, I'll focus on the PCRE regex implementation in the following, simply because I know it best (as it's used by PHP). Most other regex implementations are quite similar though, so most stuff should apply to them too.

The language hierarchy

In order to analyze what regular expressions can and cannot match, we first have to look at what other types of languages there are. A good starting point for this is the Chomsky hierarchy:

Chomsky hierarchy: /-------------------------------------------\ | | | Recursively enumerable languages | Ty

Truncated by Planet PHP, read more at the original (another 28133 bytes)