PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

The Prime That Wasn't

Note: This article was originally published at Planet PHP on 4 August 2010.
Planet PHP

I've seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:

/^1?$|^(11+?)\1+$/

You don't apply this to the integer itself. Instead, you create a string of repeating 1s, where the numbers of 1s is defined by the number you're testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP's (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).

So, how does it work? It may seem like a brain buster, but it's actually pretty simple. Ignore the portion before the alternation operator (|), because it's just a short-circuit case for checking whether the input string is empty or has a single 1. (If the string is empty, it means the number being tested is 0. If the string is a single 1, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.

The (11+?) sub-pattern matches strings 11, 111, etc. The \1+ matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match 11 and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of 11), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match 111 two or more times and so on successively. If the first sub-pattern gets long enough (n/2 basically), and the overall match still fails, the number is a prime. Beautiful, isn't it?

Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to ideone.com, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn't seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)

A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it's actually a perfect square (1492). Past that, the failure rate increased.

It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP's regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the pcre.backtrack_limit setting a few years ago. In our case, backtracking is used to break up the string of 1s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.

I bumped the limit up to 200000, and voilA, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don't use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.