"Tip: Rescue terrible HTML with TagSoup"

Well, since I've so emphatically broken my Weblogging pause for The Cup, I'd better post some professional items.

"Tip: Rescue terrible HTML with TagSoup"

Subtitle: Turn poorly formed HTML into valid XHTML
Synopsis: XHTML is a friendly enough format for parsing and screen-scraping, but the Web still has a lot of messy HTML out there. In this tip Uche Ogbuji demonstrates the use of TagSoup to turn just about any HTML into neat XHTML.

TagSoup is very handy. EVen though it's a Java project I put it to use from Python code fairly often. It also recently went full 1.0.

[Uche Ogbuji]

via Copia

Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

Xampl, re: "XML data bindings, static languages, dynamic languages"

In response to XML data bindings, static languages, dynamic languages Bob Hutchison posted some thoughts. As I used Amara as the kernel of my demonstrations, Bob used his project xampl as the kernel of his. He introduces xampl in another entry which was inspired by my own article on EaseXML.

Xampl is a an XML data binding. As Bob writes:

Secondly, there are versions of xampl for Java and Common Lisp. I’ve got an old (summer 2002) version for Ruby that needs updating (I wrote the xampl-pp pull parser to support this experiment).

Bob says that Xampl also deals with things that Elliotte Harold mentions as usual scourges for Java data bindings: mixed content, repeated elements, omitted elements, and element order. Of course these things should be food and drink to any XML tool, and I'm glad folks are finally plugging such gaping holes. Eric van der Vlist is also in the game with TreeBind, and it seems some Java tools try to wriggle out of the pinch by using XQuery.

Based on Bob's snippets, Xampl looks handy. Rather verbose, but no more so than Java pretty much requires. One thing that strikes me in Bob's examples is that Xampl appears to require and create a bogon namespace (http://www.xampl.com/labelsExample). It seems maybe it has something to do with Java packaging or something, but regardless of the role of this fake namespace, the XML represented by Xampl in Bob's snippets is not the same as the XML in the original source examples. An unprefixed element in a namespace is of course not the same thing as an element in the null namespace. I would not accept any tool that involves such a mix-up. It's quite possible that Xampl does not do so, and I'm just misunderstanding Bob's examples.

Bob provides Xampl code to match the EaseXML snippets in my article. Similarly to how EaseXML requires Python framework code, Xampl requires XML framework code. Since "XML situps" have been on the wires lately, they come to mind for a moment, but hey, if you're already processing XML with Xampl, I suppose you might not flinch at one more XML. I will point out that Amara does not require any framework code whatsoever besides the XML itself, not even an XML schema. It effectively provides dynamic code generation.

Xampl turns XML constructs into Java getters, e.g. html.getHead(). Amara uses the Python convention of properties rather than getters and setters, so you have html.head, and you can even assign to this property in order to mutate the XML. Xampl looks neat. The things that turn me off are largely things that are pretty much inevitable in Java, not least the very large amount of code generated by the binding. It supports XPath, as Amara does, and provides a "rhino" option to expose XML objects through Javascript, which offers you a bit more of the flexibility of Python (I don't know how much overhead to expect from Javascript through Java through XML, but it's a question I'd be quick to ask as a user).

It's good to have projects such as Xampl and Treebind and Nux. I'd rather use Python tools such as Amara, Gnosis and GenerateDS, but Java has the visibility and it's good for people to be aware that XML does not necessarily require greater imprisonment of expression than what comes with the application language. You don't need to accept crazy idioms and stifling limitations in matters as fundamental as mixed content and element ordering. XML and sanity can coexist.

[Uche Ogbuji]

via Copia

XML data bindings, static languages, dynamic languages

A discussion about the brokenness of W3C XML Schema (WXS) on XML-DEV turned interestingly to the topic of the limitations of XML data bindings. This thread crystallized into a truly bizarre subthread where we had Mike Champion and Paul Downey actually trying to argue that the silly WXS wart xsi:nil might be more important in XML than mixed content (honestly the arrogance of some of the XML gentry just takes my breath away). As usual it was Eric van der Vlist and Elliotte Harold patiently arguing common sense, and at one point Pete Cordell asked them:

How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?

Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:

Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-

class p
    string cdata1;        // = "This is "
    class strong strong;
    string cdata2;        // = " important"

class strong
    string cdata1;        // = "very"

as opposed to (ignoring the CDATA):

class p
    class strong strong;

class strong

or (lumping all the mixed text together):

class p
    string mixedContent;    // = "<p>This is <strong>very</strong> important</p>"

Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?

It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:

>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong>  important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']

There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:

>>> doc.p.xml_xpath(u"string(.)")
u'This is very  important'

So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:

Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.

Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.

As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.

Eric did point out Amara in the thread.

An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia

TreeBind, and incidentally Nux

In my previous entry I said "When you compare the weary nature of, say Java XML data bindings, Amara is a nice advertisement for Python's dynamicism." Interestingly, Eric van der Vlist recently mentioned to me a project in which he attempts to address some of these deficiencies within Java. TreeBind, is "yet another XML <-> Java binding API." The TreeBind page says:

The difference between TreeBind and most of the other Java binding APIs is that we've tried to minimize the need for any type of schema or configuration file and to maximize the usage of introspection of Java classes in order to facilitate the integration with existing classes."

It's about time, but is Java's introspection really enough? It doesn't save you from welding to Java's type rigidity.

It makes me recall Wolfgang Hoschek's response to one of my Amara announcements on XML-DEV. The Amara example was:

The following is complete code for iterating through address labels in an XML document, while never loading more memory than needed to hold one label element:

from amara import binderytools
        for subtree in binderytools.pushbind('/labels/label',
            print subtree.label.name, 'of', subtree.label.address.city

And Wolfgang responded:

Very handy!

FYI, analog example Java code for the current Nux version reads as follows:

StreamingTransform myTransform = new StreamingTransform() {
             public Nodes transform(Element subtree) {
                 System.out.println(XQueryUtil.xquery(subtree, "name").get(0).getValue());
                 System.out.println(XQueryUtil.xquery(subtree, "address/city").get(0).getValue());
                 return new Nodes(); // mark current element as subject to garbage collection
        Builder builder = new Builder(new StreamingPathFilter("/labels/label", null).
            createNodeFactory(null, myTransform)); //[Line split by Uche for formatting reasons]
        builder.build(new File("labels.xml"));

It's not as compact as with Amara, but still quite handy...

This is certainly a great leap forward from Java/XML APIs I've seen, even from plain XOM. I'd expect a similar leap, albeit in a different direction, in TreeBind. But in my biased opinion, even such impressive leaps lose a lot of luster when compared to the gains in expressiveness provided by the Python example. Line count is just a bit of the picture. For overall idiom, and the amount of conceptual load buried in each construct, it's hard to even place the Python and Java examples on the same scale.

In here, I think, is the crux of where dynamic language advocates are unimpressed by the productivity gains claimed by XQuery advocates. Productivity gained through declarativity should complement rather than interfere with productivity gained through natural, expressive idiom. XQuery does not meet this test. By imposing a ponderous type framework over XML it provides productivity on one hand while stifling the power of dynamic languages. this is why in Amara I seek to harness the declarative power of unencumbered little languages such as XPath and XSLT patterns to the expressive power of Python. I think this gives us a huge head start over systems using XQuery and Java introspection to tame the chore of XML processing.

See also:

[Uche Ogbuji]

via Copia