I'm working on an app where I need to parse some HTML. This is the first time I've had to do screen-scraping with Groovy. After a bit of trial and error I think I'm getting the hang of it. The HTML I'm working with isn't well-formed, so the default Groovy XmlSlurper and XmlParser puke. After some digging I found TagSoup. It "parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short".
It made my parsing much easier. Thanks John Cowan!
A blog (mostly) about nothing by a software engineer that loves to spend time with his family, golf, run, play hockey and discuss all things related to science, politics and our civilization.
Tuesday, December 29, 2009
Groovy XmlSlurper and HTTP 503 Response Code
I struggled a bit when trying to parse some XHTML with Groovy's XmlSlurper (and XmlParser). I was receiving the following:
Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
It turns out that the guys from W3C got sick of dealing with the excessive traffic for their DTDs. So now they return a Service Unavailable (HTTP 503) if they detect parser requests.
To solve the problem I had to set the loading of external DTDs to false. Here's the code.
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def results = slurper.parseText(htmlResponse)
Googling for the answer wasn't extremely helpful. This blog post helped (I think it's in Japanese). This post also helped. Thanks guys!
I decided to re-post the solution since it took me awhile googling for the answer.
Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
It turns out that the guys from W3C got sick of dealing with the excessive traffic for their DTDs. So now they return a Service Unavailable (HTTP 503) if they detect parser requests.
To solve the problem I had to set the loading of external DTDs to false. Here's the code.
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def results = slurper.parseText(htmlResponse)
Googling for the answer wasn't extremely helpful. This blog post helped (I think it's in Japanese). This post also helped. Thanks guys!
I decided to re-post the solution since it took me awhile googling for the answer.
Labels:
Groovy,
HTTP response code: 503,
XML,
XmlParser,
XmlSlurper
Sunday, December 27, 2009
The Science of Avatar
An interesting read on the science of Avatar. I still haven't seen the movie; just too much going on with the holidays.
Tuesday, December 8, 2009
(Near) Real-Time Analytics
At my new gig, I've been asking whether the team has considered the possibility of using map/reduce or a similar grid-based solution to conduct our analytics in (near) real-time. Interestingly enough, I ran across Nati Shalom's post on real-time analytics yesterday. This should help give me some ammunition to convince everyone that we need to move in this direction for the solution we're building. Thanks Nati!
A Feast for Crows
I just finished re-reading George R. R. Martin's A Feast for Crows. I enjoyed it more than the first time I read it. My favorite still continues to be A Storm of Swords. Now if he'd just publish A Dance with Dragons!
Subscribe to:
Posts (Atom)