It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.
I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):
from tidy import tidy_bind_url #needs tidy.py Amara demo url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229" doc = tidy_bind_url(url) #Display all links to mp3s (by file extension check) for link in doc.xml_xpath(u'//html:a[@href]'): if link.href.endswith(u'.mp3'): print link.href
The handy thing about Amara even in this simple example is how I was am
to take advantage of the full power of XPath for the basic query, and
then shunt in Python where XPath falls short (there's a starts-with
function in XPath 1.0 but for some reason no ends-with
). See tidy.py
for more sample code.
Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.
Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.