Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

5 responses

Neat :)

I've been wondering how hard it would be to add also support for RelaxNG to Amara (unless I haven't looked properly there is no such thing).

Due to lots of work on my side I haven't been able to try out myself but maybe it could be interesting to add it.

What do you think?

- Sylvain

— Sylvain Hellegouarch

Nudge, nudge, wink, wink, http://tagsoup.info .

— John Cowan

Sylvain,

RELAX NG support is in the plas for the Amara 2.0 branch. 4Suite includes RELAX NG support, so I expect it will be fairly straightforward.

John,

Tagsoup didn't come into my mind because it's a Java app, and I started out looking at options I could import directly in Python. But since I ended up using the command line tidy, anyway, I could have broadened my options. I'll give Tagsoup a try.

— Uche

Have you compared the results of Beautiful Soup with the results of tidy.py? I'm just curious.

http://www.crummy.com/software/BeautifulSoup/

— Dethe Elza

Dethe,

I looked at BeautifulSoup when I wrote "Wrestling HTML" and again more recently. As I explained in "Wrestling HTML", I don't think it's the same thing that tidy offers. It's more of a selection API for grabbing content from HTML documents. Using Tidy, Tagsoup and such you end up with the entire document cleaned up and ready for random access processing. So I think it's useful to have both BS and something tidy-like in the toolbox.

I'll post another entry today on the matter that will probably interest you.

— Uch