Thinking XML #33: Serving up WordNet as XML

"Thinking XML: Serving up WordNet as XML"

Subtitle: Build the basic WordNet/XML facilities into a Web server framework
Synopsis: A few articles back, Uche Ogbuji discussed WordNet 2.0, a Princeton University project that aims to build a database of English words and lexical relationships between them. He showed how to extract XML serializations from the word database. In this article he continues the exploration, demonstrating code to serve up these WordNet/XML documents over Web protocols and showing you how to access these from XSLT.

This is the second part of a mini-series within the column. The previous article is "Querying WordNet as XML,", in which I present Python code for processing WordNet 2.0 into XML. This time I use CherryPy to expose the XML on the Web, either in human-readable or in raw form. This seems to be part of a nice trend of CherryPy on developerWorks. I hope people see this as yet another example of how easy and clean CherryPy is.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia

Domlette and Saxlette: huge performance boosts for 4Suite (and derived code)

For a while 4Suite has had an 80/20 DOM implementation completely in C: Domlette (formerly named cDomlette). Jeremy has been making a lot of performance tweaks to the C code, and current CVS is already 3-4 times faster than Domlette in 4Suite 1.0a4.

In addition, Jeremy stealthily introduced a new feature to 4Suite, Saxlette. Saxlette uses the same Expat C code Domlette uses, but exposes it as SAX. So we get SAX implemented completely in C. It follows the Python/SAX API normally, so for example the following code uses Saxlette to count the elements:

from xml import sax

furi = "file:ot.xml"

class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:

from Ft.Xml import Sax
...
class element_counter():
....
parser = Sax.CreateParser()

The code changes needed from the first listing above to regular PySax are minimal. As Jeremy puts it:

Unlike the distributed PySax drivers, Saxlette follows the SAX2 spec and defaults feature_namespaces to True and feature_namespace_prefixes to False both of which are not allowed to be changed (which is exactly what SAX2 says is required). Python/SAX defaults to SAX1 behavior and Saxlette defaults to SAX2 behavior.

The following is a PySax example:

from xml import sax

furi = "file:ot.xml"

#Handler has to derive from sax.ContentHandler,'
#or, in practice, implement all interfaces
class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    #SAX1 startElement by default, rather than SAX2 startElementNS
    def startElement(self, name, attribs):
        self.ecount += 1

parser = sax.make_parser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

The speed difference is huge. Jeremy did some testing with timeit.py (using more involved test code than the above), and in those limited tests Saxlette showed up as fast as, and in some cases a bit faster than cElementTree and libxml/Python (much, much faster than xml.sax in all cases). Interestingly, Domlette is now within 30%-40% of Saxlette in raw speed, which is impressive considering that it is building a fully functional DOM. As I've said in the past, I'm done with the silly benchmarks game, so someone else will have to pursue matters to further detail if they really can't do without their hot dog eating contests.

In another exciting development Saxlette has gained a generator mode using Expat's suspend/resume capability. This means you can have a Saxlette handler yield results from the SAX callbacks. It will allow me, for example, to have Amara's pushdom and pushbind work without threads, eliminating a huge drag on their performance (context switching is basically punishment). I'm working this capability into the code in the Amara 1.2 branch. So far the effects are dramatic.

[Uche Ogbuji]

via Copia

Xampl, re: "XML data bindings, static languages, dynamic languages"

In response to XML data bindings, static languages, dynamic languages Bob Hutchison posted some thoughts. As I used Amara as the kernel of my demonstrations, Bob used his project xampl as the kernel of his. He introduces xampl in another entry which was inspired by my own article on EaseXML.

Xampl is a an XML data binding. As Bob writes:

Secondly, there are versions of xampl for Java and Common Lisp. I’ve got an old (summer 2002) version for Ruby that needs updating (I wrote the xampl-pp pull parser to support this experiment).

Bob says that Xampl also deals with things that Elliotte Harold mentions as usual scourges for Java data bindings: mixed content, repeated elements, omitted elements, and element order. Of course these things should be food and drink to any XML tool, and I'm glad folks are finally plugging such gaping holes. Eric van der Vlist is also in the game with TreeBind, and it seems some Java tools try to wriggle out of the pinch by using XQuery.

Based on Bob's snippets, Xampl looks handy. Rather verbose, but no more so than Java pretty much requires. One thing that strikes me in Bob's examples is that Xampl appears to require and create a bogon namespace (http://www.xampl.com/labelsExample). It seems maybe it has something to do with Java packaging or something, but regardless of the role of this fake namespace, the XML represented by Xampl in Bob's snippets is not the same as the XML in the original source examples. An unprefixed element in a namespace is of course not the same thing as an element in the null namespace. I would not accept any tool that involves such a mix-up. It's quite possible that Xampl does not do so, and I'm just misunderstanding Bob's examples.

Bob provides Xampl code to match the EaseXML snippets in my article. Similarly to how EaseXML requires Python framework code, Xampl requires XML framework code. Since "XML situps" have been on the wires lately, they come to mind for a moment, but hey, if you're already processing XML with Xampl, I suppose you might not flinch at one more XML. I will point out that Amara does not require any framework code whatsoever besides the XML itself, not even an XML schema. It effectively provides dynamic code generation.

Xampl turns XML constructs into Java getters, e.g. html.getHead(). Amara uses the Python convention of properties rather than getters and setters, so you have html.head, and you can even assign to this property in order to mutate the XML. Xampl looks neat. The things that turn me off are largely things that are pretty much inevitable in Java, not least the very large amount of code generated by the binding. It supports XPath, as Amara does, and provides a "rhino" option to expose XML objects through Javascript, which offers you a bit more of the flexibility of Python (I don't know how much overhead to expect from Javascript through Java through XML, but it's a question I'd be quick to ask as a user).

It's good to have projects such as Xampl and Treebind and Nux. I'd rather use Python tools such as Amara, Gnosis and GenerateDS, but Java has the visibility and it's good for people to be aware that XML does not necessarily require greater imprisonment of expression than what comes with the application language. You don't need to accept crazy idioms and stifling limitations in matters as fundamental as mixed content and element ordering. XML and sanity can coexist.

[Uche Ogbuji]

via Copia

Windows prebuilt binary package for Amara

Several Amara users have mentioned trying to build for Windows but running into problems, or a requirement for .NET. Sylvain Hellegouarch comes to the rescue with a binary package he built using Amara 1.0, latest 4Suite CVS, Python 2.4.1 and .NET 1.1 (latest patches). Windows users can install this package without having to worry about compilers or any other such hassles. Sylvain posted it under the default name generated by distutils, but I renamed it as appropriate and I'm hosting it on the Amara home page, and in the Amara contrib FTP area. Please let Sylvain and me know (by posting to the 4Suite mailing list with "[Amara]" in the subject) if you have any trouble with this package.

Meanwhile work on the 1.2 branch continues. I've got up to 30% speed and memory usage improvement over Amara Bindery 1.0, in large part by moving from PySax to 4Suite's undocumented Saxlette library (all in highly optimized C). Jeremy is also working on suspend/resume for 4Suite's parser, which would allow for a huge performance boost for pushbind. I'll try to start working on RELAX NG features if I get a chance this week.

[Uche Ogbuji]

via Copia

What Are You Doing, Dave?

I just updated the 4Suite Repository Ontology (as an OWL instance). Specifically, I added appropriate documentation for most of the major components and added rdfs:subPropertyOf/rdfs:subClass/rdfs:seeAlso relationships with appropriate / related vocabularies (WordNet/Foaf/Dublin Core/Wikipedia). In addition, where appropriate, I've added links to 4suite literature (currently scattered between IBM Developer Works articles/tutorials and Uche's Akara sections).

There are some benefits:

  • This can serve as a framework for documenting the 4Suite repository (to augment the very sparse documentation that does exist)
  • Provide a formal model for the underlying RDF Graph that 'drives' the repository

This latter benefit might not be so obvious, but imagine being able to provide rules that cause implications identifying certain repository containers as RSS channels (and their child Xml documents / Rdf document as the corresponding RSS items) and associating Foaf metadata with repository users.

Some of the more powerful hooks to the System RDF graph (which the above ontology is a model of) - such as the starting/stopping of servers (currently triggered by the fchema:server.running property on fchema:server instances), purging of resources marked as temporary (by the fchema:time_to_live property), and triggering of an XSLT transform (by the fchema:run_on_strobe property) - can be further augmented by other associations in the graph, resulting in an almost 'sentient' content/application server. A little far-fetched?

[Uche Ogbuji]

via Copia

Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia

Amara goes 1.0, gets simpler to install

I released Amara 1.0 today. It's been properly cooked for a while, but life always has its small interruptions. The big change in 1.0 is that I now offer it in a couple of package options.

For those who just need some XML processing, and don't really care about RDF, all you need is Python and Amara-1.0-allinone (grab it from the FTP site). It has a trimmed down subset of 4Suite bundled for one-step install.

For those who want the full complement of what 4Suite has to offer, or for those who've already installed 4Suite anyway, there is the stand alone Amara-1.0 package.

Here's the thing, though. Right now it feels to me that I should be pushing Amara-allinone and not Amara+4Suite. Amara-allinone contains all the 1.0 quality components of 4Suite right now. 4Suite is a combination of a rock solid core XML library, a somewhat out of date RDF library, and a quite rickety server framework. This has been bad for 4Suite. People miss out on the power of its core XML facilities because of the size and uneven quality of the rest of the package. In fact, 4Suite has been stuck in the 1.0 alpha and beta stages for ever, not because the core libraries aren't 1.0 quality (heck, they're 4.x quality), but because we keep hoping to bring the rest of the package up to scratch. It's been on my mind for a while that we should just split the package up to solve this problem. This is what I've done, in effect, with this Amara release.

As for the rest of 4Suite, the RDF engine just needs a parser update for the most recent RDF specifications. The XML/RDF repository probably doesn't need all that much more work before it's ready for 1.0 release. As for the protocol server, as I've said several times before, I think we should just chuck it. Better to use another server package, such as CherryPy.

As for Amara, I'll continue bug fixes on the 1.x branch, but the real fun will be on the 2.x branch where I'll be refactoring a few things.

[Uche Ogbuji]

via Copia

Another small 4Suite MarkupWriter example: XHTML 1.1

I was writing code to emit XHTML 1.1 using 4Suite and just to double-check the doc types I looked at the spec. I thought it might be useful to write up a small MarkupWriter example for emitting the example in the spec.

from Ft.Xml.MarkupWriter import MarkupWriter
from xml.dom import XHTML_NAMESPACE, XML_NAMESPACE

XHTML_NS = unicode(XHTML_NAMESPACE)
XML_NS = unicode(XML_NAMESPACE)
XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"

writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
                      doctypePublic=XHTML11_PUBID)
writer.startDocument()
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
                     attributes={u'href': u'http://vlib.org/'},
                     content=u'vlib.org')
writer.text(u'.')
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)
writer.endDocument()

It's worth mentioning that this example would be even simpler with template output facilities I've proposed for Amara.

[Uche Ogbuji]

via Copia

XML data bindings, static languages, dynamic languages

A discussion about the brokenness of W3C XML Schema (WXS) on XML-DEV turned interestingly to the topic of the limitations of XML data bindings. This thread crystallized into a truly bizarre subthread where we had Mike Champion and Paul Downey actually trying to argue that the silly WXS wart xsi:nil might be more important in XML than mixed content (honestly the arrogance of some of the XML gentry just takes my breath away). As usual it was Eric van der Vlist and Elliotte Harold patiently arguing common sense, and at one point Pete Cordell asked them:

How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?

Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:

Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-

class p
{
    string cdata1;        // = "This is "
    class strong strong;
    string cdata2;        // = " important"
};

class strong
{
    string cdata1;        // = "very"
};

as opposed to (ignoring the CDATA):

class p
{
    class strong strong;
};

class strong
{
};

or (lumping all the mixed text together):

class p
{
    string mixedContent;    // = "<p>This is <strong>very</strong> important</p>"
};

Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?

It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:

>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong>  important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
'<strong>very</strong>'
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']

There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:

>>> doc.p.xml_xpath(u"string(.)")
u'This is very  important'

So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:

Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.

Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.

As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.

Eric did point out Amara in the thread.

An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia