Eradicate XSS Once and for All

Since I'm such an advocate of doing output filtering rather than only input validation, I thought I'd start putting together some posts to help deal with some of the more severe semantic flaws (things that can be discovered by a lexical analysis). Remember, you should still always do input validation. But this solution will take care of all cases of classic HTML injection, and with a couple of site-wide changes, you can fix character encoding injection issues as well.

There are a million different ways an attacker can encode rotten characters going in. ha.ckers.org has a whole list of them that's being updated all the time, and there's research into even more new and clever ways of performing the injection. Fortunately, with HTML injection, there are only five characters to worry about going out, and only one way to properly encode those going out.

For every dynamic output value:
- Encode " to "
- Encode ' to '
- Encode < to &lt;
- Encode > to &gt;
- Encode & to &amp;

And before you go off the deep end trying to write the perfect function for dealing with this, know that your tool of choice probably already contains such a weapon. In Java, use the JSTL <c:out /> tag - it has an escapeXml attribute, which by default is set to true (so you have to deliberately turn it off). It works with Struts Action framework, and there are a couple of tools for making it work with Struts Shale. ASP (and .NET) have Server.HTMLEncode. Rails has h(), PHP has htmlentities().

To be perfectly safe, you can deal with characters outside the limit of your character encoding (another reason I'm a fanboy of UTF-8). You should also specify the output encoding (UTF-8, ISO-8859-1 (blech!), UTF-16, etc.) so that characters are properly identified in the range. Beware when using 8-bit encodings on output how you dealt with them coming in. In Java, for example, Unicode string reading is an assumption, so it's possible you got double-byte inputs, then are presenting those double-bytes as pairs of single bytes (so an attacker could send character 0x3c73 0x6372 which you represent on output as 0x3c 0x7c3 0x63 0x72 - <scr -
I think you get the picture).

Before you do this, you should already be applying a level of indirection. And for escaping script inside script, I recommend using hidden form variables, then HTML encoding those, and using the script to pull in the value from the form - rather than injecting dynamic values in the script itself.

Not only does this stop HTML Injection, but it's something you need to be doing anyway. If you plan to make your site XHTML compliant, you must do this for dynamic values so that the XML doesn't get busted.