Fast Multiple String Replacement in PHP
Note: This article was originally published at Planet PHP on 30 September 2010.language filter to Ning Pro last month. It lets Network Creators have naughty words (for the Network Creator's definition of "naughty") replaced with * characters.
A straightforward way to do this in PHP is to pass an array of words to look for and their replacements to a function like str_replace() or str_ireplace(). Or, similarly, use a regular expression that gloms the search terms together (and potentially checks word boundaries.) There are assorted WordPress plugins that work like this.
The problem with this approach is that it's really slow. Especially if you have a lot of words you're looking for. The amount of time it takes to do the search and replace grows in proportion to the number of words you're looking for. This is particularly unfortunate because usually, none of the words are ever found!
For our language filter, we took a different approach. We've packaged it up into a PHP extension called Boxwood and releasing it today as open source. (Find it on github: http://github.com/ning/boxwood.)
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
Take it for a drive and let us know what you think!