Misalignments with the planets

John Clark alerted me that Copia has been missing from Planet XML. I noticed it was also missing from PlanetPython. The Planet XML problem turned out to be because the XSLT used to convert the Atom feed to RSS 1.0 was written for Atom 0.3, and so stopped working when I upgraded Copia to Atom 1.0 late last year. I updated the XSLT and that's sorted out (as an unintended result I pwned that planet for a day or two). Planet Python uses a category feed in Atom from Copia, and I think the problem is that the version of Planet used in this aggregator does not yet support Atom 1.0. Planet XML uses its own aggregation software and has supported Atom 1.0 for a while.

There have been moves to update (see this message, for example). Now that FeedParser 4.0 is out with Atom 1.0 support, I expect most planets will start to correct their Atom deficiencies.

Meanwhile, John and I have been working with Sylvain Hellegouarch on yet another planet, using our own aggregation software. More on that later.

[Uche Ogbuji]

via Copia

Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

Recipe: fast scan of an XML file for one field

If you have a huge XML file and you need to grab the first instance of a particular field in a fast and memory efficient manner, a simple one-liner with Amara's pushbind does the trick.

val = unicode(amara.pushbind("book.xml", "title").next())

Returns the text of the first title element in a book.xml (which could be Docbook or any other format with a title element), loading hardly any of the file into memory. It also doesn't parse the file beyond the target element. It would be a shade slower to get such an element at the end of a file. For example, the following line gets the title of a Docbook index.

val = unicode(amara.pushbind("book.xml", "index/title").next())

Even when finding an element near the end of a file it's very fast. All my use cases were pretty much instantaneous working with a 150MB file (I'm working on convincing the client to avoid such huge files).

If the target element is not found, you'll get a StopIteration exception.

[Uche Ogbuji]

via Copia

Thinking XML #34: Search engine enhancement using the XML WordNet server system

Updated—Fixed link to "Serving up WordNet as XML"

"Thinking XML: Search engine enhancement using the XML WordNet server system"

Subtitle: Also, use XSLT to create an RDF/XML representation of the WordNet data
Synopsis: In previous installments of this column, Uche Ogbuji introduced the WordNet natural language database, and showed how to represent database nodes as XML and serve this XML though the Web. In this article, he shows how to convert this XML to an RDF representation, and how to use the WordNet XML server to enrich search engine technology.

This is the final part of a mini-series within the column. The previous articles are:

In this article I write my own flavor of RDF schema for WordNet, a transform for conversion from the XML format presented previously, and a little demo app that shows how you can use WordNet to enhance search with synonym capabilities (and this time it's a much faster approach).

I hope to publicly host the WordNet server I've developed in this series once I get my home page's CherryPy setup updated for 2.2.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia

4Suite XML 1.0b3

I posted the 4Suite XML 1.0b3 announcement today. This was supposed to be 1.0rc1 but then Jeremy went and added this little feature. Yeah, 4Suite now has full DTD validation, written in C. Just use the ValidatingReader. PyXML is no longer necessary for any 4Suite feature. I just need to figure out whether Jeremy ever sleeps. I hope to move quickly on a 1.0rc1. Perhaps even in January. We'll see.

I've updated my on-line manual

[Uche Ogbuji]

via Copia

CVS log since tag?

My usual trick for creating a "What's changed" summary in my projects is to check CVS for commits since the previous release. SO if the previous release was 24 October 2005 I run

cvs log -NSd ">2005/10/24"

It would be nice if I could do the same thing while specifying the last revision, rather than a date. I wish I could do:

cvs log -NSr<last-rev>::HEAD

but that seems to work only for numerical revisions rather than tags. Does anyone know of any neat hacks to achieve this? Note: if you prefer to advocate Subversion, that's OK, but at least be sure to specify the precise command to do this with SVN so that others can benefit from the example.

Note: this is coming up for me now because I'm wrapping up the packaging for 4Suite 1.0b3 release. One huge new feature: Full DTD support for all the parsers (written in C by the indefatigable Jeremy). One big fix: build support for 64 bit Intel architecture machines.

[Uche Ogbuji]

via Copia

XML Bookmark Exchange Language (XBEL) gets a proper home

XML Bookmark Exchange Language (XBEL)

The Python XML SIG has had some really great times in its history. One of the highlights is the development of XML Bookmark Exchange Language (XBEL). In September of 1998, just as I was joining the group, they were developing this bookmarks exchange language that's still used in more browsers and bookmark management projects than any other particular format. The XML-SIG has fallen on quiet times, and one of the side effects of this is that additional work on XBEL has been neglected.

Earlier this year we agreed on the SIG to give XBEL its own home on SourceForge, but no one stepped up to make it happen, until John L. Clark got to it last week (thanks, John).

XBEL's new home is http://sourceforge.net/projects/xbel/. The old home is still up, but I think we should move it to http://xbel.sourceforge.net/, with some updates and maybe a design update (maybe make the page XHTML). We'll be discussing such things on the new XBEL mailing list, so please come join us. The main goal is to add more features to XBEL needed for its original role in browser bookmarks exchange, but I'm also interested in making it a useful format for general Web resource lists such as feed lists (e.g. a superior alternative to OPML).

John wrote up a good summary of recent discussions of XBEL.

I'll have more on our efforts summarized here on Copia as we progress.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia