Tuesday, December 29, 2009

HTML Parsing With Groovy and TagSoup

I'm working on an app where I need to parse some HTML. This is the first time I've had to do screen-scraping with Groovy. After a bit of trial and error I think I'm getting the hang of it. The HTML I'm working with isn't well-formed, so the default Groovy XmlSlurper and XmlParser puke. After some digging I found TagSoup. It "parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short".

It made my parsing much easier. Thanks John Cowan!

Groovy XmlSlurper and HTTP 503 Response Code

I struggled a bit when trying to parse some XHTML with Groovy's XmlSlurper (and XmlParser). I was receiving the following:

Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It turns out that the guys from W3C got sick of dealing with the excessive traffic for their DTDs. So now they return a Service Unavailable (HTTP 503) if they detect parser requests.

To solve the problem I had to set the loading of external DTDs to false. Here's the code.

def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def results = slurper.parseText(htmlResponse)

Googling for the answer wasn't extremely helpful. This blog post helped (I think it's in Japanese). This post also helped. Thanks guys!

I decided to re-post the solution since it took me awhile googling for the answer.

Sunday, December 27, 2009

The Science of Avatar

An interesting read on the science of Avatar. I still haven't seen the movie; just too much going on with the holidays.

Tuesday, December 8, 2009

(Near) Real-Time Analytics

At my new gig, I've been asking whether the team has considered the possibility of using map/reduce or a similar grid-based solution to conduct our analytics in (near) real-time. Interestingly enough, I ran across Nati Shalom's post on real-time analytics yesterday. This should help give me some ammunition to convince everyone that we need to move in this direction for the solution we're building. Thanks Nati!

A Feast for Crows

I just finished re-reading George R. R. Martin's A Feast for Crows. I enjoyed it more than the first time I read it. My favorite still continues to be A Storm of Swords. Now if he'd just publish A Dance with Dragons!