Copia

Python/XML column #35: EaseXML and more on Unicode

In this month's Python and XML column, Uche Ogbuji examines a new XML data-binding tool for Python: EaseXML. [Jul. 27, 2005]

The main focus of this article is EaseXML, another option for XML data binding. I found the package rather rough around the edges. I also included a section with a bit more on Unicode, which was the topic of the last two articles "Unicode Secrets" and "More Unicode Secrets". This time I introduced the unicodedata module, which provides useful information about characters from the Unicode standard database.

[Uche Ogbuji]

via Copia

XML data bindings, static languages, dynamic languages

A discussion about the brokenness of W3C XML Schema (WXS) on XML-DEV turned interestingly to the topic of the limitations of XML data bindings. This thread crystallized into a truly bizarre subthread where we had Mike Champion and Paul Downey actually trying to argue that the silly WXS wart xsi:nil might be more important in XML than mixed content (honestly the arrogance of some of the XML gentry just takes my breath away). As usual it was Eric van der Vlist and Elliotte Harold patiently arguing common sense, and at one point Pete Cordell asked them:

How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?

Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:

Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-

class p
{
    string cdata1;        // = "This is "
    class strong strong;
    string cdata2;        // = " important"
};

class strong
{
    string cdata1;        // = "very"
};

as opposed to (ignoring the CDATA):

class p
{
    class strong strong;
};

class strong
{
};

or (lumping all the mixed text together):

class p
{
    string mixedContent;    // = "<p>This is <strong>very</strong> important</p>"
};

Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?

It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:

>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong>  important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
'<strong>very</strong>'
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']

There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:

>>> doc.p.xml_xpath(u"string(.)")
u'This is very  important'

So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:

Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.

Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.

As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.

Eric did point out Amara in the thread.

An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia

Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

Thinking XML #32: Schema annotation for bottom-up semantic transparency:

"Thinking XML: Schema annotation for bottom-up semantic transparency"

Subtitle: Pushing schemata beyond syntax into semantics
Synopsis: Learn more about the different approaches to semantic transparency as Uche Ogbuji discusses what they mean to developers using XML. Whether or not you reuse schemata, you might find it valuable to use formal annotations (as opposed to the informal annotations covered earlier). You gain benefits on several levels by doing so. On the most immediately practical level, you can generate better documentation. A more far-sighted benefit is that it gives you an important measure of semantic transparency. This installment discusses semantic anchors, and gives examples. The author also takes a moment to discuss The XTech Conference 2005.

This is the third part of a mini-series within the column. Previous articles are "State of the art in XML modeling" and "Schema standardization for top-down semantic transparency". In this article I discuss formal schema annotations, the most important tool available for semantic transparency. I started off my exploration of the technique in "Use data dictionary links for XML and Web services schemata". I mentioned why I think schema annotations are so important even in rough and ready use of XML in my discussion of XOXO.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, back to Python + WordNet.

[Uche Ogbuji]

via Copia

Python community: Transolution, py lib, encutils, pyxsldoc, PDIS and Picket

Transolution
iterxml demo in greenlets
py.xml
Picket 0.5
pyxsldoc 0.69
encutils for Python 0.2
XML tools in Personal Distributed Information Store (PDIS)

In a comment on "We need more solid guidelines for i18n in OSS projects", Fredrik Corneliusson mentioned Transolution, "[a translation] suite project I'm working on [with] a XLIFF editor and a Translation Memory. It's written in Python and we need all the help and testers we can get." I browsed the project site, and it seems to me quite comprehensive and well thought-out. It's heavy on XLIFF, which is pretty heavy stuff in itself, but it does links to projects that allow exchange between .po and XLIFF files. It's certainly great to see Python at the vanguard of XML-based i18n.

I found a couple of new tools from the py lib via Grig Gheorghiu's Weblog entry "py lib gems: greenlets and py.xml", which is good reading for those interested in Python and XML. The py lib is just a bundle of useful Python library add-ons. Greg mentioned sample for one of the modules, Armin Rigo's greenlets. Greenlets are a "spin-off" from Stackless Python, and thus provide some very interesting code that manipulates Python flow of control in order to support lightweight concurrency (microthreads), and also proper coroutines. I've already been pecking about the edges of what's possible with semi-coroutines, and it has always been clear to me that what Python needs in order to really bring streaming XML processing to life is full coroutine support (which seems to be on the way for Python 2.5. While we wait for full coroutines in Python, Armin gets us started with a greenlets demo that turns PyExpat's callback model into a generator that returns elements as they're parsed. Grig posts this snippet as "iterxml.py" in his entry.

Grig also touches on Holger Krekel's py.xml, a module for generating XML (and HTML). py.xml is not unlike JAXML, which I covered in "Three More For XML Output". These all seem to project back to Greg Stein's early proposal for an XML generation tool that certainly worth a look for others.is as ingrown as possible into Python's syntax.

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update accommodates changes in what has recently been announced as CherryPy 2.1 beta. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.

I recently discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.

[Uche Ogbuji]

via Copia

OPML, XOXO, RDF and more on outlining for note-taking

There has been a lot of good comment on my earlier entry on outline formats in XML. I've been pretty busy the past week or so, but I'd better get my thoughts down before they deliquesce.

Bob DuCharme pointed me at Micah's article which includes mention of XOXO. Henri Sivonen asked what I think of it.

Taking the name "outlining format" literally, it's actually just fine. As Micah says:

Some people might feel warmer and fuzzier with elements named outline, topic, item, and so on, or with elements in a freshly minted namespace, but microformats can still claim the semantic high ground, even when reusing XHTML. In the above, the parts of an outline are ordered lists and list items, exactly as the XHTML element names say.

The problem is that what made me start looking into outlining formats was the fact that I'd heard from others that these make such a great format for personal information space organization, and XOXO is just about useless in that regard.

Along that vector, I wonder what a pure outline format is useful for, anyway? I can't remember having ever needed a stand-alone outline document separate from what I'm outlining. If I'm writing a presentation or a long article, I'd prefer to have the table of contents or presentation outline section generated from the titles and structure of the full work. Sure, XOXO might be suitable for such a generated outline, but my exploration is really about hand editing.

In short I think XOXO is just fine for outlining, and yet I can't imagine when I'd ever use it. As others have mentioned, and as I suspected, the entire idea of outlining formats for general note-taking is a big stretch. Danny Ayers mentioned in a comment on the earlier format that for some attraction to OPML is a matter of neat outlining UIs. I've always been conservative in adopting UIs. I use emacs plus the command line for most of my coding, and after trying out a half dozen blog posting tool for posting to Copia, I ended up writing an e-mail-to-post gateway so that I can enter text (markdown) into a UI I'm already familiar with, Evolution's e-mail composition window.

As I said in the earlier entry, full-blown XHTML 2.0 makes more sense than an outlining format for managing a personal information space, and yet it seems too weak to me for this purpose. The weakness, as Danny points out, is semantic. If everything in my personal information space is just a para or an anchor or a list, I'll quickly get lost. As followers of Copia know, my brain is a rat trap of wandering thoughts, and I'm a poster child for the need for clearly expressed semantics.

As an RDF pioneer, I'm happy to use ideas from RDF, but I do not want to type RDF/XML by hand. I've always argued, as Danny Ayers hinted, that RDF should strive hard to by syntax agnostic, especially because RDF/XML is awful syntax. I agree with him that GRDDL is a good way to help rescue XHTML microformats from their semantic soup, and I think this is a better approach than trying to shovel all the metadata into the XHTML header (Dan Brickley mentions this possibility, but I wonder whether he tends to prefer it to GRDDL). GRDDL has a natural draw for me since I've been working with and writing tools for the XML+XSLT=RDF approach for about four years. But when I'm using markup for markup (e.g. in a personal information space) I'd rather have semantic transparency fitting comfortably within the markup, rather than dangling off it as an afterthought. In a nutshell, I want to use the better markup design of:

<to-do>

rather than the kludge of:

<ul class="to-do">

I think there's little excuse for why we don't have the best of both worlds. People should be able to enjoy the relative semantic cleanliness of RDF, but within the simplest constructs of markup, without having to endure the added layer of abstraction of RDF. That added layer of abstraction should only be for those aggregating models. The fact that people would have to pay the "RDF tax" every time they start to scribble in some markup explains why so many markup types dislike RDF. I'm not sure I've found as clear a case for this point than this discussion of extended uses for outlining formats.

Microformats are generally a semantic mess, from what I've seen of them. They do best when they just borrow the semantics of existing formats, as XOXO does, but I think they're not the solution to lightweight-syntax +clean-semantics that the GRDDL pioneers hope. GRDDL has too much work to do in bringing the rigor of RDF to microformats, and this work should be part of the format itself, not something like GRDDL. I think the missed opportunity here is that XML schema systems cling so stubbornly to syntax-only-syntax. As I've been exploring in articles such as "Use data dictionary links for XML and Web services schemata" (I have a more in-depth look at this in the upcoming Thinking XML article), one can make almost all the gains of RDF by putting the work into the XML schema, rather than heaping the abstraction directly into the XML format. And the schema is where such sophistication belongs.

But back to outlining and personal information spaces, I've tried the personal Wiki approach, and it doesn't work for me. Again Danny nails it: Wiki nodes and links are untyped. This is actually similar to the problem that I have with XHTML, but Wikis are even more of a semantic shambles. In XHTML there is at least a bit of an escape with class="foo". The difficulty of navigating and managing Wikis increases at a much greater rate than their information content, in my experience. My Akara project was in effect an attempt at a more semantically transparent Wiki, but since I wrote that paper I've had almost no time for Akara, unfortunately. I do plan to make it the showcase application for my vision of 4Suite 2.0, and in doing so I have an ally Luis Miguel Morillas, so there is still hope for Akara, perhaps even more so if I am able to build on Rhizome, which might help eliminate some wheel reinvention.

[Uche Ogbuji]

via Copia

New RELAX NG mailing list

The new RELAX NG users' mailing list is live on Yahoo Groups

Apparently the old list is now defunct and you should get on board the new one if you're into that schema language. I subscribed right away. I do hope Yahoo Groups doesn't still do the thing with the in-your-face ads (perhaps they learned a trick from Google Groups).

[Uche Ogbuji]

via Copia

YiJing SVG Plotter

About a year or more ago I had an idea that a simple python/SVG library could be written to aid the drawing of the very rudementary components of the yijing in modular fashion upon which the more complex diagrams could be very easily drawn (programatically). Philosophically, it can be thought of extending the concepts within the text into a program that represents the ideas in it. A little beatnick-ish? Well, using SVG, binary numerics and an understanding of the more fundamental arrangements of the trigrams I was able to write such a library: YiJingPlotter.py. It takes advantage of the translation of the trigrams to their binary values (see earlier post) in order to draw them in 2 dimensional coordinate space (leveraging SVG for this purpose). And in 218 lines of code I was able to write the library as well as 2 utility functions that produced the two most (arguably) fundamental / useful arrangements of the trigrams in SVG:

FuXi's circular arrangement

Shao Yung's square diagram

Once again I would embed the SVG diagrams, but alas there is still (apparently) no browser-agnostic way to do this (someone inform me if there is)

The library (written in python) relies on:

4Suite's XMLWriter

I tried to comment as heavily as possible for anyone interested in using the library to generate other diagrams. Comments from the second of the two utility functions are below:

Another demonstration of a classic arrangement drawn using the gua/trigram plotting functions. This is ShaoYong's Square. Probably the most useful (in my opinion) arrangement for observing the relationships between the fully developed 64 gua. Within each row, the lower trigrams are all of the same kind (he refered to them as the 'palace' of earth, mountain, etc..) and within each column the upper trigrams are also of the same kind. So, essentially it is a 2 dimensional plot of the 64 gua where the X coordinate is the upper gua and the Y coordinate is the lower gua. This incredible numeric symmetry comes from simply drawing the gua in ascending binary order from 0 - 63, 8 per line! I've added the english names of the corresponding coordinates so a student can match up the lower/upper gua (by name) to find the gua formed.

Note: I'm still unsure of the proper spelling of Shao Yung's name (Wikipedia has it as Shao Yung, however I've seen various references to Shao Yong)

Chimezie Ogbuji

via Copia

Wow! XML formats for outlining are complete rubbish

There's no other way to put it other than the title. After having heard a lot about OPML (and having used it as a blind RSS feed exchange format), Ian Forrester's comment that he is using OPML for all his personal note-taking finally pushed me to look seriously at the format. It is just complete and utter garbage. OPML might possibly be the best example of the worst abuses of XML markup. It's really hard to fully express how horrible OPML is, and there is no way in the world that I'll ever be dealing with it directly. The language that I use as a hub for my personal information space needn't be perfect, but it can't make me gag at every other tag.

I looked a bit further and found OML. It's a reaction to the ugliness of OPML and so I expected it would be the ticket for me. It does partially fix perhaps the most immediate and visceral abomination of OPML: the abuse of attributes for prosaic textual content (although why it doesn't completely eliminate the text attribute in favor of a title child element is beyond me). But it leaves a lot of nastiness and introduces some of its own (the idea of item as generic extensibility element is hugely ill-begotten). OML isn't even widely-used enough to just compromise and deal with its flaws. I think I'll consider creating my own language. I can export to OPML via XSLT when I really have to. But I think I can use some of the fixes in OML as a starting point.

"Sharing, the web way", by Danny Ayers is a good outline [n/m] of the horrors of OPML. He does get into the politics as well, which I think are less important to me than the technical flaws. He does state what has always been my reaction to the OPML hype:

But more and more I'm thinking things like blogrolls or whatever are much better handled using something more specific - XBEL or simply (X)HTML (like Netscape bookmarks).

Of course I'd plump for XBEL, but this expresses my general viewpoint. I wanted to look at outlining formats because so many people go on about outline editors and formats as productivity tools. I want to be sure I'm not missing anything. Based on what I've found so far, I'm really confused at what people are gaining in this space.

Danny goes on to say:

If what you want is versatility and be able to combine material from different sources and of potentially different data types, then you really do need something like RDF - the glue of OPML isn't strong enough.

I think that O*ML is just one of those examples that illustrate the limits of RDF. RDF is not the best model for content with a significant prose quotient, and RDF/XML not the best syntax. I think that a format I came up with would have the following characteristics:

Lightweight overall framework using other formats such as XBEL and XHTML 2.0 to do the heavy lifting
Sound XML design overall
Metadata sections that can readily be mapped to RDF model without using RDF/XML directly

Where would that take me that no one else has already charted? Maybe nowhere. I certainly wouldn't plan to spend much time on it (and I might not even bother at all). It's just a bit of exploration suggested to me by what I've found poking around what's there already.

[Uche Ogbuji]

via Copia

Copia

Ogbujis on an abundance of topics

Tag xml

Python/XML column #35: EaseXML and more on Unicode

XML data bindings, static languages, dynamic languages

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

Use Amara to parse/process (almost) any HTML

Thinking XML #32: Schema annotation for bottom-up semantic transparency:

Python community: Transolution, py lib, encutils, pyxsldoc, PDIS and Picket

OPML, XOXO, RDF and more on outlining for note-taking

New RELAX NG mailing list

YiJing SVG Plotter

FuXi's circular arrangement

Shao Yung's square diagram

Wow! XML formats for outlining are complete rubbish