Windows prebuilt binary package for Amara

Several Amara users have mentioned trying to build for Windows but running into problems, or a requirement for .NET. Sylvain Hellegouarch comes to the rescue with a binary package he built using Amara 1.0, latest 4Suite CVS, Python 2.4.1 and .NET 1.1 (latest patches). Windows users can install this package without having to worry about compilers or any other such hassles. Sylvain posted it under the default name generated by distutils, but I renamed it as appropriate and I'm hosting it on the Amara home page, and in the Amara contrib FTP area. Please let Sylvain and me know (by posting to the 4Suite mailing list with "[Amara]" in the subject) if you have any trouble with this package.

Meanwhile work on the 1.2 branch continues. I've got up to 30% speed and memory usage improvement over Amara Bindery 1.0, in large part by moving from PySax to 4Suite's undocumented Saxlette library (all in highly optimized C). Jeremy is also working on suspend/resume for 4Suite's parser, which would allow for a huge performance boost for pushbind. I'll try to start working on RELAX NG features if I get a chance this week.

[Uche Ogbuji]

via Copia

Python/XML column #36: Should Python and XML Coexist?

"Python and XML: Should Python and XML Coexist?"

In his latest Python and XML column, Uche Ogbuji claims that the costs of using XML as a little language in a Python application may outweigh the benefits of doing so. [Aug. 25, 2005]

In this article I discuss some of the recent round of complaints about XML in the Python community, trying to give perspective that Python and XML should serve very different domains. Treating them in competition for any particular task is often a more general problem of misunderstanding the basic nature of one technology or the other, and it often leads to overstated complaints.

A correspondent already asked one good follow-up question:

My question is simply: do you have a recommendation for an alternative language (or other protocol) that is more suitable for expressing data structures, preferably one that can be coded for reasonably quickly in Python?

YAML seems to be the front-runner for a cross-language data structure format, although JSON is hot these days, courtesy the AJAX hype. I tend to point to Paul Tchistopolskii's, "Alternatives to XML" for a more comprehensive list.

[Uche Ogbuji]

via Copia

2006 conferences, part 1

"Dallas PyCon bid accepted"

Yay! PyCon moves out of the grey D.C. area. And it moves for 2006 to one of my favorite cities, Dallas, where I lived (Irving) from 1994 to 1996. PyCon comes at what lately has been a tough time of year for me to attend, but I think I'll just have to scrape together the odds for 2006. After all, it will be a chance to spend some time back at my favorite martial arts school (note to Dallas folk. If you're considering martial arts training, don't even think twice: get yourself to Tim Bulot's academy.)

w.r.t Andrew's disclaimer. Never fear. All signed, sealed and delivered, now.

[Uche Ogbuji]

via Copia

Amara goes 1.0, gets simpler to install

I released Amara 1.0 today. It's been properly cooked for a while, but life always has its small interruptions. The big change in 1.0 is that I now offer it in a couple of package options.

For those who just need some XML processing, and don't really care about RDF, all you need is Python and Amara-1.0-allinone (grab it from the FTP site). It has a trimmed down subset of 4Suite bundled for one-step install.

For those who want the full complement of what 4Suite has to offer, or for those who've already installed 4Suite anyway, there is the stand alone Amara-1.0 package.

Here's the thing, though. Right now it feels to me that I should be pushing Amara-allinone and not Amara+4Suite. Amara-allinone contains all the 1.0 quality components of 4Suite right now. 4Suite is a combination of a rock solid core XML library, a somewhat out of date RDF library, and a quite rickety server framework. This has been bad for 4Suite. People miss out on the power of its core XML facilities because of the size and uneven quality of the rest of the package. In fact, 4Suite has been stuck in the 1.0 alpha and beta stages for ever, not because the core libraries aren't 1.0 quality (heck, they're 4.x quality), but because we keep hoping to bring the rest of the package up to scratch. It's been on my mind for a while that we should just split the package up to solve this problem. This is what I've done, in effect, with this Amara release.

As for the rest of 4Suite, the RDF engine just needs a parser update for the most recent RDF specifications. The XML/RDF repository probably doesn't need all that much more work before it's ready for 1.0 release. As for the protocol server, as I've said several times before, I think we should just chuck it. Better to use another server package, such as CherryPy.

As for Amara, I'll continue bug fixes on the 1.x branch, but the real fun will be on the 2.x branch where I'll be refactoring a few things.

[Uche Ogbuji]

via Copia

alt.unicode.kvetch.kvetch.kvetch

Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have people chosen to forget how many billions of people there are on the planet? How many languages, cultures and writing systems are represented in this number? What the heck do you expect besides complex and arbitrary? Or do people think it would be better to exclude huge sections of the populace from information technology just to make their programmer lives that marginal bit easier? Or maybe people think it would be easier to deal separately with each of the hundreds of local encoding text schemes that were incorporated into Unicode? What the hell?

Listen. Unicode is hard. Unicode will melt your brain at times, it's the worst system out there, except for all the alternatives. (Yeah, yeah, the old chestnut). But if you want to produce software that is ready for a global user base (actually, you often can't escape the need for character i18n even if you're just writing for your hamlet), you have little choice but to learn it, and you'll be a much better developer once you've learned it properly. Read it from the well-regarded Joel Spolsky: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". And remember, as Joel says, "There Ain't No Such Thing As Plain Text".

As for Python, that language has an excellent Unicode implementation. A group of brilliant contributors tackled a barge load of pitfalls and complexities involved in shunting full Unicode support into the workings of a language that was already in heavy use. I think they nailed most of the compromises, and again a lot of the things that people rail at as arbitrary and obstructing are so precisely because other (in some cases superficially attractive) alternatives are even worse. Almost all languages that support Unicode have snags of their own (see above: Unicode is hard). It's instructive to read Norbert Lindenberg's valediction in World Views. Norbert is the technical lead for Java internationalization at Sun and in this entry he summarizes the painful process of i18n in Java, including Unicode support. Considering the vast resources available for Java development you have to give Python's developers a lot of credit for doing so well with so few resources.

The most important thing for Pythoneers to remember is that if you go through the discipline of using Unicode properly in all your APIs, as I urge over and over again in my Python XML column, you will not even notice most of these snags. My biggest gripe with Python's implementation: confusion between code points and storage units in several of the "string" APIs, is addressed in practice (through theoretical fudge, to be precise) by always using Python compiled for UCS4 character storage, as I discussed at the beginning of "More Unicode Secrets". See that article's sidebar for more background on this issue.

Because Python currently has both legacy strings and Unicode objects and because the APIs for these overlap, there is some room for confusion (again, there was little practical choice considering reasonable requirements for backwards compatibility). The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects.

It is worth a word on the most common gripe: ASCII for the default site encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:

print <unicode>, <int>, <unicode>, <string>

Then to be safe you have to do:

print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),

or:

out = codecs.getwriter(mylocale_encoding)(sys.stdout)
print >> out, <unicode>, <int>, <unicode>, <string>

or one of the other variations on this theme. The thing is that in Python using encoding assumed from locale would be a bit of an all-or-nothing problem, which means that the most straightforward solution to this need would unravel all the important compromises of Python's Unicode implementation, and cause portability problems. Just to be clear on what this specious solution is, here is the relevant quote from the Unicode PEP: # 100:

Note that the default site.py startup module contains disabled optional code which can set the according to the encoding defined by the current locale. The locale module is used to extract the encoding from the locale default settings defined by the OS environment (see locale.py). If the encoding cannot be determined, is unkown or unsupported, the code defaults to setting the to 'ascii'. To enable this code, edit the site.py file or place the appropriate code into the sitecustomize.py module of your Python installation.

Don't to this. I admit the wart, but I don't see it as fundamental. I just see it as a gap in the API. If we had a cleaner way of writing "print-according-to-locale", I think we could close that docket.

But the bottom line is that glibly saying that "Unicode sucks" or that "Python's Unicode sucks" because of these inevitable pitfalls is understandable as vented frustration when tackling programming problems (heck, integers suck, pixels suck, binary logic sucks, the Internet sucks, and so on: you get the point). But it's important for the rant readers not to get the wrong impression. Pay attention for just a minute:

  • Unicode solves a complex and hard problem, trying to make things as simple as possible, but no simpler
  • Python's Unicode implementation reflects the necessary complexity of Unicode, with the addition of compatibility and portability concerns
  • You can avoid almost all the common pitfalls of Python and Unicode by applying a discipline consisting of a handful of rules
  • If you insist on the quickest hack out of a Unicode-related problem, you're probably asking for a dozen other problems to eventually take its place

My recent articles on Unicode include a modest XML bent, but there's more than enough on plain old Python and Unicode for me to recommend them:

At the bottom of the first two are resources and references that cover just about everything you ever wanted or needed to know about Unicode in Python.

[Uche Ogbuji]

via Copia

Another small 4Suite MarkupWriter example: XHTML 1.1

I was writing code to emit XHTML 1.1 using 4Suite and just to double-check the doc types I looked at the spec. I thought it might be useful to write up a small MarkupWriter example for emitting the example in the spec.

from Ft.Xml.MarkupWriter import MarkupWriter
from xml.dom import XHTML_NAMESPACE, XML_NAMESPACE

XHTML_NS = unicode(XHTML_NAMESPACE)
XML_NS = unicode(XML_NAMESPACE)
XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"

writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
                      doctypePublic=XHTML11_PUBID)
writer.startDocument()
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
                     attributes={u'href': u'http://vlib.org/'},
                     content=u'vlib.org')
writer.text(u'.')
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)
writer.endDocument()

It's worth mentioning that this example would be even simpler with template output facilities I've proposed for Amara.

[Uche Ogbuji]

via Copia

Python/XML column #35: EaseXML and more on Unicode

"EaseXML: A Python Data-Binding Tool"

In this month's Python and XML column, Uche Ogbuji examines a new XML data-binding tool for Python: EaseXML. [Jul. 27, 2005]

The main focus of this article is EaseXML, another option for XML data binding. I found the package rather rough around the edges. I also included a section with a bit more on Unicode, which was the topic of the last two articles "Unicode Secrets" and "More Unicode Secrets". This time I introduced the unicodedata module, which provides useful information about characters from the Unicode standard database.

[Uche Ogbuji]

via Copia

XML data bindings, static languages, dynamic languages

A discussion about the brokenness of W3C XML Schema (WXS) on XML-DEV turned interestingly to the topic of the limitations of XML data bindings. This thread crystallized into a truly bizarre subthread where we had Mike Champion and Paul Downey actually trying to argue that the silly WXS wart xsi:nil might be more important in XML than mixed content (honestly the arrogance of some of the XML gentry just takes my breath away). As usual it was Eric van der Vlist and Elliotte Harold patiently arguing common sense, and at one point Pete Cordell asked them:

How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?

Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:

Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-

class p
{
    string cdata1;        // = "This is "
    class strong strong;
    string cdata2;        // = " important"
};

class strong
{
    string cdata1;        // = "very"
};

as opposed to (ignoring the CDATA):

class p
{
    class strong strong;
};

class strong
{
};

or (lumping all the mixed text together):

class p
{
    string mixedContent;    // = "<p>This is <strong>very</strong> important</p>"
};

Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?

It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:

>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong>  important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
'<strong>very</strong>'
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']

There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:

>>> doc.p.xml_xpath(u"string(.)")
u'This is very  important'

So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:

Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.

Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.

As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.

Eric did point out Amara in the thread.

An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia

Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia