html Sanitisation Benchmarking With Wibble (ZF Proposal)
Note: This article was originally published at Planet PHP on 8 July 2010.
That's the state of html Sanitisation in PHP - pick a big slow library that crushes Cross-Site Scripting and Phishing attacks, or use yet another regular expression based sanitiser that a) barely manages a fraction of htmlPurifier's features and b) can probably be exploited by any scriptkiddie working with a stack of data cards. It says an awful lot about security standards among PHP developers that such delusions are uncomprehendingly rampant.
In case you haven't noticed, I'm biased. Sue me.
I have opined since forever that regular expression sanitisers are nothing short of insane. Since the problem with htmlPurifier is speed and size, I started thinking about ways to build something like htmlPurifier that was fast, small and almost as feature packed as htmlPurifier. At first, this sounds like an impossible task. The typical suggestion is to use regular expressions, but I'm not completely insane...yet. Instead I borrowed a concept called a DOM Filter and chucked in a helpful dose of html Tidy. The result was Wibble.
Wibble is basically a DOM Filter. It loads up html into PHP DOM, applies a set of filters against all nodes in the DOM, passes the output through html Tidy, and then hands it back to the user - sanitised and well-formed. It's almost stupid in its obviousness. Better, this allows Wibble to skip regular expression dependence. It operates far more like htmlPurifier by relying on a DOM representation (no string parsing to funk around with) partnered with Tidy for cleanup.
Of course, there have to be regular expressions somewhere. And whitelists. And other stuff. Wibble is really an amalgamation of borrowed concepts. It's hard to be too original in html Sanitisation because originality is a good way to shoot yourself in the foot (hence regex is EVIL!), so I wasn't going to spend too long digging my own grave when there is a wealth of sanitisation resources in the programming world. Wibble's approach borrows elements from Ruby's loofah, Python's html5Lib, and Java's AntiSamy. Wibble mixes and matches from the useful design elements each of these offers, serving them up on top of PHP's DOM and Tidy extensions with its own distinctive twists.
I completed the first Wibble prototype recently, so I figured that with something that was at that 90% point where the remaining 10% would be in-depth sanity testing, cleanup and documentation, it was time to see how it compared to some other PHP solutions (htmlPurifier and HtmLawed). I had some fairly conservative performance objectives so the results came as a pleasant surprise.
If you are a benchmark fiend, you can download and independently fiddle with my benchmark process from http://github.com/padraic/wibble-benchmarks. Note that the current benchmark uses a Wibble prototype - there are additional elements that need to be added over time. The benchmark currently uses three sample snippets of html: Small (blog comment size), Medium (markup heavy with limited textual content), and Big (markup light with lots of textual content). It operates by filtering each html sample 200 times with each benchmarked html sanitisation solution. Each iteration includes the instantiation and setup phases of each solution (where relevant) to reflect the most likely real world experience of using sanitisation as a once off (non-repeating in same request) process. I use PEAR's Benchmark package to record the aggregate run time per loop of sanitisation tasks. All operations occur within one single PHP process with htmlPurifier caching enabled (Wibble and HtmLawed do not use caching). Each solution is configured as close as possible to target total stripping of all html from the content.
You can view a sample result at http://gist.github.com/468426.
The results show that both Wibble and Htm
Truncated by Planet PHP, read more at the original (another 2564 bytes)