In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.
And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:
wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar ln -s tagsoup-1.0rc3.jar tagsoup.jar
That's all. Thanks, John.
Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.
import re import urllib from BeautifulSoup import BeautifulSoup url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229" stream = urllib.urlopen(url) soup = BeautifulSoup(stream) for incident in soup('a', {'href' : re.compile('\\..*mp3$')}): print incident['href']
Very nice. I wonder how far that little XPath-like convention goes.
In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.
It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.
And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."
I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.
Hmm. Does that count as a Quotīdiē?