Tuesday, December 29, 2009

HTML Parsing With Groovy and TagSoup

I'm working on an app where I need to parse some HTML. This is the first time I've had to do screen-scraping with Groovy. After a bit of trial and error I think I'm getting the hang of it. The HTML I'm working with isn't well-formed, so the default Groovy XmlSlurper and XmlParser puke. After some digging I found TagSoup. It "parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short".

It made my parsing much easier. Thanks John Cowan!

4 comments:

Anonymous said...

Your blog keeps getting better and better! Your older articles are not as good as newer ones you have a lot more creativity and originality now keep it up!

Anonymous said...

Nice brief and this enter helped me alot in my college assignement. Thanks you as your information.

Anonymous said...

Hello. And Bye.

crizzcoxx said...

Off topic here, but Intuit is looking for a very Sr. top notch SaaS Architect for a major product line in Plano, TX. Given the content and the caliber of people on this blog I'd thought I'd check in to see if anyone could be interested. Visit here for a full listing: http://crizzcoxx.blogspot.com/

Thanks,
Chris