20070823

I've Said it Before, but...

Now, ordinarily, I hate screen-scraping. If there's any other way to get the raw data, I go there first. I go through whatever (ethical) channels necessary to get direct access to the source data, whether it be in a relational database, LDAP, XML, straight text, spreadsheets, or made up out of nowhere. I can't stand screen-scraping because screen-scraping is normally very sensitive to change. Screen-scraping HTML is not generally as bad as telnet or green-screen, but it's still bad enough that I try to avoid doing it - particularly when the HTML is malformed.

But today was another occasion where it was necessary simply because I couldn't get access to the systems I needed in the timeframe I needed. What would have made this one more difficult is that the source HTML had very few line breaks to do text parsing, and the HTML was also not properly XML formatted, and to boot, was poorly-enough HTML formatted that an event-based HTML parser simply wasn't going to work out.

Fortunately, I've done a little screen scraping with Groovy before, and this task wasn't significantly more difficult than other tasks I've done before. And again, NekoHTML came to the rescue. NekoHTML takes poorly-formatted HTML code (in this case, really poorly formatted) and balances unbalanced tags (had plenty), closes unclosed tags (had many), quotes unquoted attribute values (lots of those today), and gives sensible default values to un-valued attributes (and a bunch of those, too). What results is actually well-formed (not necessarily validating, but well-formed) XML, which you can parse with any ordinary XML parse.

In this case, I used XmlParser, which allows me to do very nice GPath queries. GPath works similarly to XPath, but allows you to find really complicated paths. For example, in English, "find me the text in all the <strong< tags that are under <a> tags such that their 'href' attribute matches this regular expression." In an event-based parser, that would take a lot of work, in DOM it would be easier, but still a lot of code, and the XPath would just be nasty. In GPath, it looks like this:

texts = page.depthFirst().A.grep { it.'@href' =~ /^.*\.action\?foo=(.*)$/ }.collect { it.value.STRONG.value }

Which is much fewer lines of code.

Now, what does this have to do with application security? For those who do black-box testing, there are times that your toolkit doesn't quite have enough in it. Your proxy is powerful, but just won't get you all the values you need. If there are special considerations you need to take in order to try brute-force authentication, or if you've found a good SQL injection attack, but the way the data comes back is finicky, scripting is often appropriate. So if you're looking for another swiss-army knife, some (understandably) are still Perl enthusiasts, (understandably) happy with Python, (understandably) infatuated with Ruby, but so far, Groovy has really been doing good work for me.

That being said, the GPath statements aren't specific to HTML - GPath works with XML, which is why you might need NekoHTML. And NekoHTML isn't specific to Groovy - it's a java library, so you can use it with your other java code and use whatever XML handling you prefer.

0 comments: