Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = ""
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia