del.icio.us daily links, using Amara

I added a new feature on Copia: Every day there will be an automated posting with mine and Chime's del.icio.us links from the previous day. You can see, in the previous Copia entry to this one, an example of the results.

What I think most cool is how easy it was to write, and how easy the resulting code is to understand. It's just 35 lines (including 7 lines of imports) , and in that it packs some useful features I haven't found in other such scripts, including:

  • Full Unicode safety (naturally, I wouldn't have it any other way)
  • support for multiple del.icio.us feeds, with tag by author
  • tagging the PyBlosxom entry with the aggregated/unique tags from the del.icio.us entries

Here's the code. The only external requirement is Amara:

import os
import sets
import time
import codecs
import itertools
from datetime import date, timedelta

from amara import binderytools

TAGBASE = 'http://del.icio.us/tag/'

#Change BASEDIR and FEEDS to customize
BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'
FEEDS = ['http://del.icio.us/rss/uche', 'http://del.icio.us/rss/chimezie']

now = time.gmtime()
timestamp = unicode(time.strftime('%Y-%m-%dT%H:%M:%SZ', now))
targetdate = (date(*now[:3]) - timedelta(1)).isoformat()

#Using Amara.  Easy to just grab the RSS feed
docs = map(binderytools.bind_uri, FEEDS)
items = itertools.chain(*[ doc.RDF.item for doc in docs ])
current_items = [ item for item in items
                       if unicode(item.date).startswith(targetdate) ]
if current_items:
    # Create a Markdown page with the daily bookmarks.
    dir = '%s/%s' % (BASEDIR, targetdate)
    if not os.path.isdir(dir):
        os.makedirs(dir)
    f = codecs.open('%s/%s/del.icio.us.links.txt' % (BASEDIR, targetdate), 'w', 'utf-8')

    # Pyblosxom Title
    f.write(u'del.icio.us bookmarks for %s\n' % targetdate)

    tags = sets.Set()
    for item in current_items:
        tags.update([ li.resource[len(TAGBASE):] for li in item.topics.Bag.li ])
    f.write(u'#post_time %s\n'%(timestamp))
    f.write(u'<!--keywords: del.icio.us,%s -->\n'%(u','.join(tags)))

    for item in current_items:
        # List of links in Markdown.
        title = getattr(item, 'title', u'')
        href = getattr(item, 'link', u'')
        desc = getattr(item, 'description', u'')
        creator = getattr(item, 'creator', u'')
        f.write(u'* "[%s](%s)": %s *(from %s)*\n' % (title, href, desc, creator))

    f.close()

Or download amara_delicious.py.

You can see how easily you can process RSS 1.0 in Amara. I don't think actual RDF parsing/processing is a bit necessary. That extra layer is the first thing that decided me against Matt Biddulph's module, in addition to his use of libxml for XML processing, which is also used in Roberto De Almeida's.

[Uche Ogbuji]

via Copia

Dare's XLINQ examples in Amara

Dare's examples for XLINQ are interesting. They are certainly more streamlined than the usual C# and Java fare I see, but still a bit clunky compared to what I'm used to in Python. To be fair a lot of that is on the C# language, so I'd be interested in seeing what XLINK looks like from Python.NET or Boo.

The following is my translation from Dare's fragments into corresponding Amara fragments (compatible with the Amara 1.2 branch).

'1. Creating an XML document'

import amara
#Done in 2 chunks just to show the range of options
#Another way would be to start with amara.create_document
skel = '<!--XLinq Contacts XML Example--><?MyApp 123-44-4444?><contacts/>'
doc = amara.parse(skel)
doc.contacts.xml_append_fragment("""<contact>
  <name>Patrick Hines</name>
  <phone>206-555-0144</phone>
  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>
    <state>WA</state>
    <postal>68042</postal>
  </address>
</contact>
""")

'2. Creating an XML element in the "http://example.com" namespace'

doc.xml_create_element(u'contacts', u'http://example.com')

'3. Loading an XML element from a file'

amara.parse_path('c:\myContactList.xml')

'4. Writing out an array of Person objects as an XML file'

persons = {}
persons[u'Patrick Hines'] = [u'206-555-0144', u'425-555-0145']
persons[u'Gretchen Rivas'] = [u'206-555-0163']
doc.xml_create_element(u'contacts')
for name in persons:
    doc.contacts.xml_append_fragment('<person><name>%s</name></person>'%name)
    for phone in persons[name]:
        doc.contacts.person[-1].xml_append_fragment('<phone>%s</phone>'%phone)
print doc.xml()

'5. Print out all the element nodes that are children of the <contact> element'

for c in contact.xml_child_elements():
    print c.xml()

'6. Print all the <phone> elements that are children of the <contact> element'

for c in contact.xml_xpath(u'phone'):
    print c.xml()

'7. Adding a <phone> element as a child of the <contact> element'

contacts.xml_append_fragment('<phone>%s</phone>'%'206-555-0168')

'8. Adding a <phone> element as a sibling of another <phone> element'

mobile = contacts.xml_create_element(u'phone', content=u'206-555-0168')
first = contacts.phone
contacts.xml_insert_after(first, mobile)

'9. Adding an <address> element as a child of the <contact> element'

contacts.xml_append_fragment("""  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>
    <state>WA</state>
    <postal>68042</postal>
  </address>
""")

'10. Deleting all <phone> elements under a <contact> element'

for p in contact.phone: contact.xml_remove_child(p)

'11. Delete all children of the <address> element which is a child of the <contact> element'

contacts.contact.address.xml_clear()

'12. Replacing the content of the <phone> element under a <contact> element'

#Not really necessary: just showing how to clear the content
contact.phone.xml_clear()
contact.phone = u'425-555-0155'

'13. Alternate technique for replacing the content of the <phone> element under a <contact> element'

contact.phone = u'425-555-0155'

'14. Creating a contact element with attributes multiple phone number types'

#I'm sure it's clear by now how easy this would be with xml_append_fragment
#So here is the more analogous API approach
contact = contacts.xml_create_element(u'contact')
contact.xml_append(contact.xml_create_element(u'name', content=u'Patrick Hines'))
contact.xml_append(
    contact.xml_create_element(u'phone',
                               attributes={u'type': u'home'},
                               content=u'206-555-0144'))
contact.xml_append(
    contact.xml_create_element(u'phone',
                               attributes={u'type': u'work'},
                               content=u'425-555-0145'))

'15. Printing the value of the <phone> element whose type attribute has the value "home"'

print u'Home phone is:', contact.xml_xpath(u'phone[@type="home"]')

'16. Deleting the type attribute of the first <phone> element under the <contact> element'

del contact.phone.type

'17. Transforming our original <contacts> element to a new <contacts> element containing a list of <contact> elements whose children are <name> and <phoneNumbers>'

new_contacts = doc.xml_create_element(u'contacts')
for c in doc.contacts.contact:
    new_contacts.xml_append_fragment('''<contact>
    <name>%s</name>
    <phoneNumbers/>
    </contact>'''%c.name)
    for p in c.phone:
        new_contacts.phoneNumbers.xml_append(p)

'18. Retrieving the names of all the contacts from Washington, sorted alphabetically '

wash_contacts = contacts.xml_xpath(u'contact[address/state="WA"]')
names = [ unicode(c.name) for c in contacts.contact ]
names.sort()

[Uche Ogbuji]

via Copia

Solution: simple XML output "templates" for Amara

A few months ago in "Sane template-like output for Amara" I discussed ideas for making the Amara output API a little bit more competitive with full-blown templating systems such as XSLT, without adopting all the madness of template frameworks.

I just checked in the simplest patch that does the trick. Here is an example from the previous article:

Amara 1.0 code:

person_elem = newdoc.xml_element(
        u'person',
        attributes={u'name': unicode(person.name)}
    )
newdoc.xml_append(person_elem)

Proposed Amara 1.2 code:

newdoc.xml_append_template("<person name='{person.name}'/>")

What I actually checked into CVS today for Amara 1.2:

newdoc.xml_append_fragment("<person name='%s'/>"%person.name)

That has the advantage of leaning as much as possible on an existing Python concept (formatted strings). As the method name indicates, this is conceptually no longer a template, but rather a fragment of XML in text form. The magic for Amara is in allowing one to dynamically create XML objects from such fragments. I think this is a unique capability (shared with 4Suite's MarkupWriter) for Python XML output APIs (I have no doubt you'll let me know if I'm wrong).

Also, I think the approach I settled on is best in light of the three "things to ponder" from the older article.

  • Security. Again I'm leaning on a well-known facility of Python, and not introducing any new holes. The original proposal would have opened up possible issues with tainted strings in the template expressions.
  • String or Unicode? I went with strings for the fragments. It's up to the developer to make sure that however he constructs the XML fragment, the result is a plain string and not a Unicode object.
  • separation of model and presentation. There is a very clear separation between Python operations to build a string XML fragment (these are usually the data model objects), and any transforms applied to the resulting XML binding objects (this is usually the separate presentation side). Sure a determined developer can write spaghetti, but I think that with xml_append_fragment it's possible and natural to have a clean separation. With most template systems, this is very hard to achieve.

One other thing to mention is that the dynamic incorporation of the new fragment into the XML binding makes this a potential building block for pipelined processing architecture.

def process_link(body, href, content):
    body.xml_append_fragment('%s'%(href, content))
    #Send the "a" element object that was just appended to
    #the next pipeline stage
    check_unique(body.a[-1])
    return

def check_unique(a_node):
    if not a_node.href in g_link_dict:
        #index the href to the link text (a element text content)
        g_link_dict[a_node.href] = unicode(a_node)
    return

[Uche Ogbuji]

via Copia

Python/XML column #37 (and out): Processing Atom 1.0

"Processing Atom 1.0"

In his final Python-XML column, Uche Ogbuji shows us three ways to process Atom 1.0 feeds in Python. [Sep. 14, 2005]

I show how to parse Atom 1.0 using minidom (for those who want no additional dependencies), Amara Bindery (for those who want an easier API) and Universal Feed Parser (with a quick hack to bring the support in UFP 3.3 up to Atom 1.0). I also show how to use DateUtil and Python 2.3's datetime to process Atom dates.

As the teaser says, we've come to the end of the column in its present form, but it's more of a transition than a termination. From the article:

And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the XML.com audience.

It is too bad that I don't get to some of the articles that I had in the queue, including coverage of lxml pygenx, XSLT processing from Python, the role of PEP 342 in XML processing, and more. I can still squeeze some of these topics into the new column, I think, as long as I make an emphasis on the Web. I'll also try to keep up my coverage of news in the Python/XML community here on Copia.

Speaking of such news, I forgot to mention in the column that I'd found an interesting resource from John Shipman.

[F]or my relatively modest needs, I've written a more Pythonic module that uses minidom. Complete documentation, including the code of the module in 'literate programming' style, is at:

http://www.nmt.edu/tcc/help/pubs/pyxml/

The relevant sections start with section 7, "xmlcreate.py".

[Uche Ogbuji]

via Copia

Domlette and Saxlette: huge performance boosts for 4Suite (and derived code)

For a while 4Suite has had an 80/20 DOM implementation completely in C: Domlette (formerly named cDomlette). Jeremy has been making a lot of performance tweaks to the C code, and current CVS is already 3-4 times faster than Domlette in 4Suite 1.0a4.

In addition, Jeremy stealthily introduced a new feature to 4Suite, Saxlette. Saxlette uses the same Expat C code Domlette uses, but exposes it as SAX. So we get SAX implemented completely in C. It follows the Python/SAX API normally, so for example the following code uses Saxlette to count the elements:

from xml import sax

furi = "file:ot.xml"

class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:

from Ft.Xml import Sax
...
class element_counter():
....
parser = Sax.CreateParser()

The code changes needed from the first listing above to regular PySax are minimal. As Jeremy puts it:

Unlike the distributed PySax drivers, Saxlette follows the SAX2 spec and defaults feature_namespaces to True and feature_namespace_prefixes to False both of which are not allowed to be changed (which is exactly what SAX2 says is required). Python/SAX defaults to SAX1 behavior and Saxlette defaults to SAX2 behavior.

The following is a PySax example:

from xml import sax

furi = "file:ot.xml"

#Handler has to derive from sax.ContentHandler,'
#or, in practice, implement all interfaces
class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    #SAX1 startElement by default, rather than SAX2 startElementNS
    def startElement(self, name, attribs):
        self.ecount += 1

parser = sax.make_parser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

The speed difference is huge. Jeremy did some testing with timeit.py (using more involved test code than the above), and in those limited tests Saxlette showed up as fast as, and in some cases a bit faster than cElementTree and libxml/Python (much, much faster than xml.sax in all cases). Interestingly, Domlette is now within 30%-40% of Saxlette in raw speed, which is impressive considering that it is building a fully functional DOM. As I've said in the past, I'm done with the silly benchmarks game, so someone else will have to pursue matters to further detail if they really can't do without their hot dog eating contests.

In another exciting development Saxlette has gained a generator mode using Expat's suspend/resume capability. This means you can have a Saxlette handler yield results from the SAX callbacks. It will allow me, for example, to have Amara's pushdom and pushbind work without threads, eliminating a huge drag on their performance (context switching is basically punishment). I'm working this capability into the code in the Amara 1.2 branch. So far the effects are dramatic.

[Uche Ogbuji]

via Copia

Xampl, re: "XML data bindings, static languages, dynamic languages"

In response to XML data bindings, static languages, dynamic languages Bob Hutchison posted some thoughts. As I used Amara as the kernel of my demonstrations, Bob used his project xampl as the kernel of his. He introduces xampl in another entry which was inspired by my own article on EaseXML.

Xampl is a an XML data binding. As Bob writes:

Secondly, there are versions of xampl for Java and Common Lisp. I’ve got an old (summer 2002) version for Ruby that needs updating (I wrote the xampl-pp pull parser to support this experiment).

Bob says that Xampl also deals with things that Elliotte Harold mentions as usual scourges for Java data bindings: mixed content, repeated elements, omitted elements, and element order. Of course these things should be food and drink to any XML tool, and I'm glad folks are finally plugging such gaping holes. Eric van der Vlist is also in the game with TreeBind, and it seems some Java tools try to wriggle out of the pinch by using XQuery.

Based on Bob's snippets, Xampl looks handy. Rather verbose, but no more so than Java pretty much requires. One thing that strikes me in Bob's examples is that Xampl appears to require and create a bogon namespace (http://www.xampl.com/labelsExample). It seems maybe it has something to do with Java packaging or something, but regardless of the role of this fake namespace, the XML represented by Xampl in Bob's snippets is not the same as the XML in the original source examples. An unprefixed element in a namespace is of course not the same thing as an element in the null namespace. I would not accept any tool that involves such a mix-up. It's quite possible that Xampl does not do so, and I'm just misunderstanding Bob's examples.

Bob provides Xampl code to match the EaseXML snippets in my article. Similarly to how EaseXML requires Python framework code, Xampl requires XML framework code. Since "XML situps" have been on the wires lately, they come to mind for a moment, but hey, if you're already processing XML with Xampl, I suppose you might not flinch at one more XML. I will point out that Amara does not require any framework code whatsoever besides the XML itself, not even an XML schema. It effectively provides dynamic code generation.

Xampl turns XML constructs into Java getters, e.g. html.getHead(). Amara uses the Python convention of properties rather than getters and setters, so you have html.head, and you can even assign to this property in order to mutate the XML. Xampl looks neat. The things that turn me off are largely things that are pretty much inevitable in Java, not least the very large amount of code generated by the binding. It supports XPath, as Amara does, and provides a "rhino" option to expose XML objects through Javascript, which offers you a bit more of the flexibility of Python (I don't know how much overhead to expect from Javascript through Java through XML, but it's a question I'd be quick to ask as a user).

It's good to have projects such as Xampl and Treebind and Nux. I'd rather use Python tools such as Amara, Gnosis and GenerateDS, but Java has the visibility and it's good for people to be aware that XML does not necessarily require greater imprisonment of expression than what comes with the application language. You don't need to accept crazy idioms and stifling limitations in matters as fundamental as mixed content and element ordering. XML and sanity can coexist.

[Uche Ogbuji]

via Copia

Windows prebuilt binary package for Amara

Several Amara users have mentioned trying to build for Windows but running into problems, or a requirement for .NET. Sylvain Hellegouarch comes to the rescue with a binary package he built using Amara 1.0, latest 4Suite CVS, Python 2.4.1 and .NET 1.1 (latest patches). Windows users can install this package without having to worry about compilers or any other such hassles. Sylvain posted it under the default name generated by distutils, but I renamed it as appropriate and I'm hosting it on the Amara home page, and in the Amara contrib FTP area. Please let Sylvain and me know (by posting to the 4Suite mailing list with "[Amara]" in the subject) if you have any trouble with this package.

Meanwhile work on the 1.2 branch continues. I've got up to 30% speed and memory usage improvement over Amara Bindery 1.0, in large part by moving from PySax to 4Suite's undocumented Saxlette library (all in highly optimized C). Jeremy is also working on suspend/resume for 4Suite's parser, which would allow for a huge performance boost for pushbind. I'll try to start working on RELAX NG features if I get a chance this week.

[Uche Ogbuji]

via Copia

Amara goes 1.0, gets simpler to install

I released Amara 1.0 today. It's been properly cooked for a while, but life always has its small interruptions. The big change in 1.0 is that I now offer it in a couple of package options.

For those who just need some XML processing, and don't really care about RDF, all you need is Python and Amara-1.0-allinone (grab it from the FTP site). It has a trimmed down subset of 4Suite bundled for one-step install.

For those who want the full complement of what 4Suite has to offer, or for those who've already installed 4Suite anyway, there is the stand alone Amara-1.0 package.

Here's the thing, though. Right now it feels to me that I should be pushing Amara-allinone and not Amara+4Suite. Amara-allinone contains all the 1.0 quality components of 4Suite right now. 4Suite is a combination of a rock solid core XML library, a somewhat out of date RDF library, and a quite rickety server framework. This has been bad for 4Suite. People miss out on the power of its core XML facilities because of the size and uneven quality of the rest of the package. In fact, 4Suite has been stuck in the 1.0 alpha and beta stages for ever, not because the core libraries aren't 1.0 quality (heck, they're 4.x quality), but because we keep hoping to bring the rest of the package up to scratch. It's been on my mind for a while that we should just split the package up to solve this problem. This is what I've done, in effect, with this Amara release.

As for the rest of 4Suite, the RDF engine just needs a parser update for the most recent RDF specifications. The XML/RDF repository probably doesn't need all that much more work before it's ready for 1.0 release. As for the protocol server, as I've said several times before, I think we should just chuck it. Better to use another server package, such as CherryPy.

As for Amara, I'll continue bug fixes on the 1.x branch, but the real fun will be on the 2.x branch where I'll be refactoring a few things.

[Uche Ogbuji]

via Copia

XML data bindings, static languages, dynamic languages

A discussion about the brokenness of W3C XML Schema (WXS) on XML-DEV turned interestingly to the topic of the limitations of XML data bindings. This thread crystallized into a truly bizarre subthread where we had Mike Champion and Paul Downey actually trying to argue that the silly WXS wart xsi:nil might be more important in XML than mixed content (honestly the arrogance of some of the XML gentry just takes my breath away). As usual it was Eric van der Vlist and Elliotte Harold patiently arguing common sense, and at one point Pete Cordell asked them:

How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?

Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:

Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-

class p
{
    string cdata1;        // = "This is "
    class strong strong;
    string cdata2;        // = " important"
};

class strong
{
    string cdata1;        // = "very"
};

as opposed to (ignoring the CDATA):

class p
{
    class strong strong;
};

class strong
{
};

or (lumping all the mixed text together):

class p
{
    string mixedContent;    // = "<p>This is <strong>very</strong> important</p>"
};

Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?

It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:

>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong>  important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
'<strong>very</strong>'
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']

There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:

>>> doc.p.xml_xpath(u"string(.)")
u'This is very  important'

So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:

Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.

Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.

As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.

Eric did point out Amara in the thread.

An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia