html Sanitisation: The Devil's In The Details (And The Vulnerabilities)
Note: This article was originally published at Planet PHP on 9 August 2010.
In this article, I take a look at some of the solutions PHP developers rely upon to perform html Sanitisation. Mostly because few others have done it or written about such solutions in any great detail (at least publicly). html Sanitisation has a very low profile in PHP. It's rarely mentioned, usually not understood all that well, and examining some of the solutions in this area with more deliberate attention is worth doing. Also, it's valuable research since I am writing my own html Sanitisation library (bias alert!) for a future Zend Framework 2.0 proposal. Knowing what the competition is up to does no harm! Finally, I was simply curious. Nobody seems too pushed to look closely at all these html Sanitisation solutions despite the fact that there are other developers (I think) who wouldn't touch most of them with a barge pole.
One somewhat remarkable example, just to illustrate why I figured this article was worth the time, is htmlPurifier's Comparison analysis where htmlPurifier is compared against a number of other html Sanitisers. The comparison is remarkable because it seems inclined to err on the side of giving htmlPurifier's competitors the benefit of the doubt. Unfortunately, this means the analysis is often flawed and its conclusions suspect. Also, it assists in legitimising other solutions in the minds of readers by making assumptions of safety. Not that this reflects on htmlPurifier's functionality, incidentally, which I have always maintained is the only html Sanitiser worth looking at.
Back on tracka€¦
What is html Sanitisation?
There are two ways of dealing with these threats to the html output of an application: escape output so that the only html rendered by the browser is the application's (anything else being neutered by html entities), or by sanitising output so that any additional html it contains, that is renderable by a browser, is stripped of any potentially damaging markup. This article concerns the second option.
html Sanitisation may therefore be defined as any means of filtering html to ensure that a) Cross-Site Scripting (XSS) vulnerabilities are removed, b) Phishing vulnerabilities are removed, c) the html is well formed and adheres to an acceptable html standard, and d) the html contains no obvious means of breaking expected web page rendering.
I won't claim this is a perfect definition but it covers most of the salient points you'll likely encounter.
Since this is intended as a brief examination (just a few million words long!), I decided to select four candidate html Sanitisers meeting certain conditions. These conditions included:
1. Having a release at some point in the past two years;
2. Describing itself as a html sanitiser/filter to prevent Bad Things;
3. Having a design clearly in line with an intent to filter XSS/Phishing; and
4. Having no publicly acknowledged long standing security vulnerabilities.
The great part about applying these conditions is that I pretty much eliminated stacks of html Sanitisers (as some might claim them as being). Outside of those, it also eliminates anything users might misconstrue as a html Sanitiser (for example PHP's strip_tags() function or Zend Framework's Zend_Filter_StripTags class). What we are left with is pretty thin on the ground, but fits what I'd expect a reasonably educated PHP developer to swing with. From what remained, I selected four candidates (or maybe these were the only four left
Truncated by Planet PHP, read more at the original (another 23169 bytes)