Allowing Script and HTML Content from untrusted sources


I think I've said a billion times that the MySpace model of allowing HTML and/or script is an exception, not a rule. However, it seems the exceptions are getting more and more prominent as businesses are driven to allow dynamic content from their customers in order to help the company's bottom line.

A colleague today asked me (without reading here first - shame, shame) how a company he is helping could allow markup and script, but not allow just any old markup or script. Of course, my first response is "why?", but I keep that to myself. My second response is always very aggressive white-listing of what you believe is acceptable.

I think the colleague has pretty well told their customer that starting with XHTML and a restrictive DTD (or even better, XML-Schema or RelaxNG) will be the most beneficial starting place. This way, you can rely on the schema and your really excellent processor to define what is and isn't allowed. Granted, you'll have to not allow entity definition and other things that could potentially cause XML processing DoS's. But then you're left with a whitelist of the types of tags available.

Once you have a really scaled-down, well-defined list of what is allowed, you can then go through attributes and perform additional whitelisting. Suppose we don't allow javascript in href's on attributes - we can just check the href attributes in the DOM (we know we have a valid DOM and a finite set of attributes to test since it passed the schema), and verify that they all begin with http:// or https:// . Img tags would work similarly.

For those things that do need scripting, you define new sets of tags. This is not dis-similar to Blogger's plugin model. They allow scripting, but they control the script - you have to put script in by using their pre-defined plugin tags, which use (probably) an XSLT to translate that to script they can deal with.

It's not going to be without work, but I think my colleague is going to be able to propose a solution to their customer that will be quite secure, and still give the clients the control over their content they crave. Thanks be to Blogger for the model. Blogger is certainly not the only site that operates this way, but it beats the myspace model - allow anything until a worm starts, then disallow the exact vector that created the worm.

Seems this is precisely what XML and XSLT were invented for...

1 comment:

  1. SHAMELESS SELF-PLUG: For an already existing, well-tested implementation of the concept, see HTML Purifier. There are some major deficiencies in DTDs, XML Schemas and Relax NG that have convinced me not to use them for implementing anti-XSS facilities, this is mainly in regards to implementing custom validation routines for things inside attributes and content models.