PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

html Sanitisation: The Devil's In The Details (And The Vulnerabilities)

Note: This article was originally published at Planet PHP on 10 August 2010.
Planet PHP
html Sanitisation (defined below) has been with us for a long time, ever since the first genius who came up with the idea of allowing potentially untrustworthy third party html to be dynamically patched into their own markup. The years have not taken this kindly, and third-party html inclusion has remained one of the most complex and underappreciated vectors for security vulnerabilities.

In this article, I take a look at some of the solutions PHP developers rely upon to perform html Sanitisation. Mostly because few others have done it or written about such solutions in any great detail (at least publicly). html Sanitisation has a very low profile in PHP. It's rarely mentioned, usually not understood all that well, and examining some of the solutions in this area with more deliberate attention is worth doing. Also, it's valuable research since I am writing my own html Sanitisation library (bias alert!) for a future Zend Framework 2.0 proposal. Knowing what the competition is up to does no harm! Finally, I was simply curious. Nobody seems too pushed to look closely at all these html Sanitisation solutions despite the fact that there are other developers (I think) who wouldn't touch most of them with a barge pole.

One somewhat remarkable example, just to illustrate why I figured this article was worth the time, is htmlPurifier's Comparison analysis where htmlPurifier is compared against a number of other html Sanitisers. The comparison is remarkable because it seems inclined to err on the side of giving htmlPurifier's competitors the benefit of the doubt. Unfortunately, this means the analysis is often flawed and its conclusions suspect. Also, it assists in legitimising other solutions in the minds of readers by making assumptions of safety. Not that this reflects on htmlPurifier's functionality, incidentally, which I have always maintained is the only html Sanitiser worth looking at.

Back on tracka

What is html Sanitisation?



html is an amazingly dangerous thing. It can contain Javascript, CSS, or malformed markup, or even gigantic images that laugh at your dual 32" monitor setup. Each of these, in their own way, can damage the experience of an end user, whether it be by Cross-Site Scripting (XSS), Phishing or simply mangling the page until it's unusable and/or defaced with scriptkiddie jibes.

There are two ways of dealing with these threats to the html output of an application: escape output so that the only html rendered by the browser is the application's (anything else being neutered by html entities), or by sanitising output so that any additional html it contains, that is renderable by a browser, is stripped of any potentially damaging markup. This article concerns the second option.

html Sanitisation may therefore be defined as any means of filtering html to ensure that a) Cross-Site Scripting (XSS) vulnerabilities are removed, b) Phishing vulnerabilities are removed, c) the html is well formed and adheres to an acceptable html standard, and d) the html contains no obvious means of breaking expected web page rendering.

I won't claim this is a perfect definition but it covers most of the salient points you'll likely encounter.

So there are, broadly speaking, four primary objectives of html Sanitisation, any one of which is capable of preventing damage to end users or web application functionality (including javascript powered client side functionality). Each is, in its own way, quite a difficult proposition requiring suitable tools and specialised knowledge. However, with some objectives we can measure our success somewhat reliably. The question of this article being: how well do html Sanitisers in PHP measure up to these objectives?

The Candidates



Since this is intended as a brief examination (just a few million words long!), I decided to select four candidate html Sanitisers meeting certain conditions. These conditions included:

1. Having a release at some point in the past two years;
2. Describing itself as a html sanitiser/filter to prevent Bad Things;
3. Having a design clearly in line with an intent to filter XSS/Phishing; and
4. Having no publicly acknowledged long standing security vulnerabilities.

The great part about applying these conditions is that I pretty much eliminated stacks of html Sanitisers (as some might claim them as being). Outside of those, it also eliminates anything users might misconstrue as a html Sanitiser (for example PHP's strip_tags() function or Zend Framework's Zend_Filter_StripTags class). What we are left with is pretty thin on the ground, but fits what I'd expect a reasonably educated PHP developer to swing with. From what remained, I selected four candidates (or maybe these were the only four left

Truncated by Planet PHP, read more at the original (another 23169 bytes)