PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Regex html Sanitisation: Off With Its Head!

Note: This article was originally published at Planet PHP on 18 April 1080.
Planet PHP


Image via Wikipedia

A long time ago someone coined the phrase Cross-Site Scripting and it became popularly abbreviated as XSS (the X was suggested to avoid confusion with CSS). XSS is a family of vulnerabilities that allows an attacker to inject arbitrary content, often Javascript, into the output (not necessarily html) viewed by users of a web application. These injections tend to do bad things. It is a plague upon web applications and not just those written in PHP.

Defeating Cross-Site Scripting

The solutions which prevent and defend against XSS in html are commonly known:

If you inject data into html (e.g. a template), and cannot be 110% sure it never crossed paths with a malicious user, you escape it. In PHP this means passing such data through a function like PHP's htmlspecialchars(), always remembering to pass the character encoding of your output as the third parameter. An alternative exists for cases where you do not determine the html markup of output, for example, when aggregating content from RSS or Atom feeds, from web service API responses, from html emails, from user comments where html is allowed, or even from the output of html transformers such as libraries which translate BBCode, Markdown or some other intermediate format into html. These alternatives are usually called html Sanitisers or XSS Cleaners.

The first case is simple, easy to execute, and very difficult to spoof. Its main problem is that it requires foreknowledge of the character encoding of the output since html special characters may differ between encodings. A simple example of this encoding difference is found by comparing UTF-8 and UTF-7. Whereas UTF-8 is US ASCII compatible, UTF-7 is not. Escaping UTF-7 markup as if it were UTF-8 would cause the escaping mechanism to fail in detecting the angular brackets that html tags are enclosed by since they occupy different points in UTF-7. Obviously, such a failure is a potential disaster - especially if your output supports a UTF-7 encoding, or if it never specifies a character encoding at all either via a header or a html meta tag since this may allow some browsers (coughaInternet Explorer) to guess the wrong encoding to use.

The second case is complex. There are no easy solutions or single paragraph pearls of wisdom you can rely on. Instead of a simple function, you need a library of code capable of parsing html and handling character encoding differences. Then you need a friendly API so programmers aren't buried in the complexity of the task, a whitelist and whitelist limiter to defend against misconfiguration, knowledge of every html standard since the dawn of time and up to the minute advice on emerging html quirks across all browsers (even the ones you think are no longer used). After that, you are not done. You're only beginning. You're going to need a parser and lexer, a character encoding handler, a html tidier so you don't break stuff, a possy of XSS wizards to tell you when you're failing, and enough unit tests that if Sebastian Bergmann knew what you were doing, even he would jump

Truncated by Planet PHP, read more at the original (another 7584 bytes)