Misalignments with the planets

John Clark alerted me that Copia has been missing from Planet XML. I noticed it was also missing from PlanetPython. The Planet XML problem turned out to be because the XSLT used to convert the Atom feed to RSS 1.0 was written for Atom 0.3, and so stopped working when I upgraded Copia to Atom 1.0 late last year. I updated the XSLT and that's sorted out (as an unintended result I pwned that planet for a day or two). Planet Python uses a category feed in Atom from Copia, and I think the problem is that the version of Planet used in this aggregator does not yet support Atom 1.0. Planet XML uses its own aggregation software and has supported Atom 1.0 for a while.

There have been moves to update (see this message, for example). Now that FeedParser 4.0 is out with Atom 1.0 support, I expect most planets will start to correct their Atom deficiencies.

Meanwhile, John and I have been working with Sylvain Hellegouarch on yet another planet, using our own aggregation software. More on that later.

[Uche Ogbuji]

via Copia

Finding xForms.xpi for FireFox 1.5

The maintainer of the official extensions page for Mozilla XForms has been very slow to update the main link to point to an xforms.xpi that works with the FireFox 1.5 release. A comment on this page as well as the XForms project page point to a nightly FTP location as the up to date source for the XPI. I've used the XPI from that link on one Windows box and two Ubuntu boxes. It worked on all but one Ubuntu box. Today I tried a reinstall for the problem case, but when I tried the above link I got an FTP 550 error ("Failed to change directory"). The directory is still in the index for its parent, so I'm not sure what's up. Indeed I'm not able to change to any of the children of ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/ with Firefox or ncftp.

I did some searching and found this Australian mirror. The nightly date is different, so I can't be sure it's the same xpi (I don't have the other xforms.xpi to checksum), but perhaps it will help someone else. It didn't help me; that xpi doesn't seem to work any more than the earlier one I tried. XForms simply don't render.

I do hope Mozilla gets its XForms act together. There is some hope now that lead developer Beaufour has found a temporary sponsor (woohoo!). He says:

I guess we should release a new version of the XPI soon too, to include some of the stuff that has been in the trunk for a while now, and I should try to get back to my weekly/bi-weekly “XForms Status Updates” too.

Yes to all that, but do please also make sure people can actually find the extension for use.

[Uche Ogbuji]

via Copia

Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

Recipe: fast scan of an XML file for one field

If you have a huge XML file and you need to grab the first instance of a particular field in a fast and memory efficient manner, a simple one-liner with Amara's pushbind does the trick.

val = unicode(amara.pushbind("book.xml", "title").next())

Returns the text of the first title element in a book.xml (which could be Docbook or any other format with a title element), loading hardly any of the file into memory. It also doesn't parse the file beyond the target element. It would be a shade slower to get such an element at the end of a file. For example, the following line gets the title of a Docbook index.

val = unicode(amara.pushbind("book.xml", "index/title").next())

Even when finding an element near the end of a file it's very fast. All my use cases were pretty much instantaneous working with a 150MB file (I'm working on convincing the client to avoid such huge files).

If the target element is not found, you'll get a StopIteration exception.

[Uche Ogbuji]

via Copia

Today's XML Wot He Said

Yaay! Starting 2006 with a WHS rather than a WTF. An auspicious sign. OK, OK, the original does not actually mention XML, but it's very relevant. Via Bill de hÓra a good summary of why modeling people's names and related constructs is always trickier than you think.

In Thinking XML #31 I discussed this matter in the third section "Naming names", jumping off from a comment from John Cowan on the OpenDocument mailing list.

IMHO (and I've worked on the problem for some years), all attempts to structure names so that they work correctly across cultures (and with scholarship being international now, the problem comes up repeatedly) just don't work....

[Uche Ogbuji]

via Copia

Using .signature in Evolution

I'm just about fed up with Evolution by now, and I'm looking for ways to loosen my dependency from the monster (see upcoming entry). It tries to be way too smart with your data and makes some basic things matters for hair-pulling. One example concerns e-mail signatures. I have a ~/.signature, as has been the convention since the beginnings of days. I've found it ridiculously hard to get Evolution to work with that. Evolution 2.x allows you to paste in a signature, or set up a script to generate signatures, but not to use an existing file. I tried hacking up a script along the lines of cat ~/.signature but that didn't work because Evolution expects you to use <br> in your script output rather than newlines. WTF! I not using HTML mail, damn it! Yeah, I could hack around that using sed and all that, but I shouldn't have to. After some googling and experimentation I finally figured out a workaround recipe.

1) Create a signature using Edit -> Preferences and then `Composer Preferences -> Signatures -> Add. Give it the namesymlink .sig` and put some easily spotted but inconsequential text in the body. Save it.

2) grep foo ~/.evolution/signatures/* where you replace "foo" with your dummy body text from the last step. This will tell you the file in which Evolution saved the dummy signature.

3) Clobber the dummy signature with a symlink to your .signature file, e.g. ln -sf ~/.signature ~/.evolution/signatures/signature-5

4) Go to the Mail Account preferences and set symlink .sig as the signature for each account on the Identity tab.

5) No Evolution restart required (small mercies, folks)

[Uche Ogbuji]

via Copia

Ubuntu across the board

My experiment with Ubuntu on two other computers have gone very well, so I bit the bullet over the New Year's holiday (while recuperating from my intercontinental trip) and migrated to Ubuntu on my right arm machine, my Dell Inspiron 8600.

Setting up the laptop proved trickier than the desktops, but I don't think any of my troubles were Ubuntu's fault. The first pass seemed to go well enough. It was really cool when the greeting chime came on too loud, I reached on a lark for the Dell's special software volume buttons (which had never worked under Fedora), and I saw a little volume widget dutifully cranking down the volume for me. That's the famous Ubuntu "it just works" experience.

Then the trouble began. I started to restore key files from the full backup I'd made to an external USB disk. All of a sudden I started getting "unable to write" errors. My file system was read-only. dmesg revealed the dreaded:

[4302337.674000] EXT3-fs error (device hda3): ext3_new_block: Allocating block in system zone - block = 3244032
[4302337.674000] Remounting filesystem read-only

I rebooted a few times, but the file system always found a serious error and remounted read-only after a while. I'm not alone in this. I decided a reinstall was in order. I started the reinstall, but now I started to have a problem where the install disk would boot, but then the installer would claim it could not mount the CD. Again I'm not alone (1), (2). I followed the hints that this might mean a bad install CD and burned a new one. Success this time. It occurred to me that a corrupt CD might have been the reason for the file system problems before, as well.

I resumed the backup, but I now found that the backup tar/bz2 on the USB disk was corrupted. I suspect this might have been related to the file system blow up, because I'd sanity checked the backup before nuking my laptop drive. Oh well, luckily I had a secondary backup of my most important files (my home directory) via rsync to another computer. I restored from that. My Wifi connection kept falling over as I copied Gigs of files via scp, but I got through it. I think this is just persistence of my old Intel Centrino kernel driver problems.

This time I left the install of things such as multimedia and Firefox 1.5 to Automatix, which rocks, and is one of Ubuntu's secret weapons. I've learned that it's much more predictable to install things from Automatix a small group at a time, rather than clicking every checkbox of interest and doing a mass install. The problem with this is that the checkboxes of what's being done by Automtix are not maintained between sessions. You can keep track of what you've done by checking ~/automatix.log.

We'll just have to see over time whether I ever even cast a glance back towards the Red Hat/Fedora family. So far, not a bit of that. Ubuntu impresses my socks off, and I'm glad I've made the switch.

[Uche Ogbuji]

via Copia

Moving FuXi onto the Development Track

[by Chimezie Ogbuji]

I was recently prompted to consider updating FuXi to use the more recent CVS versions of both Pychinko and rdflib. In particular, I've been itching to get Pychinko working with the new rdflib API – which (as I've mentioned) has had it's API updated significantly to support (amongst other things) support for Notation 3 persistence.

Currently, FuXi works with frozen versions of cwm, rdflib, and Pychiko.

I personally find it more effective to work with reasoning capabilities within the context of a querying language than as a third party software library. This was the original motivation for creating FuXi. Specifically, the process of adding inferred statements, dispatching a prospective query and returning the knowledge base to it's original state is a perfect compromise between classic backward / forward chaining.

It frees up both the query processor and persistence layer from the drudgery of logical inference – a daunting software requirement in its own right. Of course, the price paid in this case is the cumbersome software requirements.

It's well worth noting that such on-demand reasoning also provides a practical way to combat the scalability limitations of RDF persistence.

To these ends, I've updated FuXi to work with the current (CVS) versions of rdflib, 4Suite RDF, and pychinko. It's essentially a re-write and provides 3 major modules:

  • FuXi.py (the core component – a means to fire the pychinko interpreter with facts and rules from rdflib graphs)
  • AgentTools.py (provides utility functions for the parsing and scuttering of remote graphs)
  • VersaFuXiExtensions.py (defines Versa extension functions which provide scutter / reasoning capabilities)

Versa Functions:

reason(expr)

This function takes a Versa expression as a string and evaluates it after executing FuXi using any rules associated with the current graph (via a fuxi:ruleBase property). FuXi (and Pychinko, consequently) use the current graph (and any graphs associated by rdfs:isDefinedBy or rdfs:seeAlso) as the set of facts against which the rules are fired.

class(instances)

This function returns the class(es) – rdfs:Class or owl:Class – of the given list of resources. If the current graph has already been extended to include inferred statements (via the reason function, perhaps), it simply returns the objects of all rdf:type statements made against the resources. Otherwise, it registers, compiles, and fires a set of OWL/RDFS rules (a reasonable subset of owl-rules.n3 and rdfs-rules.n3 bundled with Euler) against the current graph (and any associated graphs) before matching classes to return.

type(klasses)

This essentially overrides the default 4Suite RDF implementation of this 'built-in' Versa function which attempts to apply RDFS entailment rules in brute force fashion. It behaves just like class with the exception that it returns instances of the given classes instead (essentially it performs the reverse operation).

scutter(url,expr,steps=5)

This function attempts to apply some best practices in the interpretation of a network of remote RDF graphs. In particular it uses content negotiation and Scutter principles to parse linked RDF graphs (expressed in either RDF/XML or Notation 3). The main use case for this function (and the primary motivation for writing it) is identity-reasoning within a remsotely-hosted set of RDF Graphs (FOAF smushing for example)

The FuXi software bundle includes a short ontology documenting the two RDF terms: one is used to manage the automated association of a rule base with a graph and the other identifies a graph that has been expanded by inference.

I have yet to write documentation, so this piece essentially attempts to serve that purpose, however included in the bundle are some unittest cases for each of the above functions. It works off a small set of initial facts.

Unfortunately, a majority of the aforementioned software requirement liability has to do with Pychinko's reliance on the SWAP code base. Initially, I began looking for a functional subset to bundle but later decided it was against the spirit of the combined body of work. So, until a better solution surfaces, the SWAP code can be checked out from CVS like so (taken from ):

$ cvs -d:pserver:anonymous@dev.w3.org:/sources/public login
password? anonymous
$ cvs -d:pserver:anonymous@dev.w3.org:/sources/public get 2000/10/swap

The latest 4Suite CVS snapshot can be downloaded from ftp://ftp.4suite.org/pub/cvs-snapshots/4Suite-CVS.tar.gz,
Pychinko can be retrieved from the Mindswap svn repository, and rdflib can also be retrieved from it's svn repository.

Chimezie Ogbuji

via Copia

Back from Nigeria

The Boulder Super Shuttle dropped us off at home at almost 2 a.m. this morning after a marathon 50 hour trip from Calabar, through Abuja, Amsterdam and Minneapolis (we ended up driving rather than flying back to the airport in Abuja for our return flight).

We're all exhausted but otherwise very happy. It was such a wonderful trip. Lori and the kids loved it. Lori admits that it certainly defied all her preconceived notions of Africa, and even for me there was so much that I found changed after 15 years of absence. I'll write more about it later, but for now I wanted to mention that uche.ogbuji.net was down for the past couple of weeks. The server had a maintenance reboot while I was gone and I hadn't left directions for re-launching the new CherryPy set-up I'd made for the site. I've restarted it now, and will look to add it to the startup scripts.

Next I'll address the backlog of comments.

[Uche Ogbuji]

via Copia

Store-agnostic REGEX Matching and Thread-safe Transactional Support in rdflib

[by Chimezie Ogbuji]

rdflib now has (checked into svn trunk) support for REGEX matching of RDF terms and thread-safe transactional support. The transactional wrapper provides Atomicity, Isolation, but not Durability (a list of reversal RDF operations is stored on the live instance - so they won't survive a system failure). The store implementation is responsible for Consistency.

The REGEX wrapper provides a REGEXTerm which can be used in any of the RDF term 'slots' with:

It replaces any REGEX term with a wildcard (None) and performs the REGEX match after the query invokation is dispatched to the store implementation it is wrapping.

Both are meant to work with a live instance of an RDF Store, but behave as a proxy for the store (providing REGEX and/or transactional support).

For example:

from rdflib.Graph import ConjunctiveGraph, Graph
from rdflib.store.REGEXMatching import REGEXTerm, REGEXMatching
from rdflib.store.AuditableStorage import AuditableStorage
from rdflib.store import Store
from rdflib import plugin, URIRef, Literal, BNode, RDF

store = plugin.get('IOMemory',Store)()
regexStorage = REGEXMatching(store)
txRegex =  AuditableStorage(regexStorage)
g=Graph(txRegex,identifier=URIRef('http://del.icio.us/rss/chimezie'))
g.load("http://del.icio.us/rss/chimezie")
print len(g),[t for t in g.triples((REGEXTerm('.*zie$'),None,None))]
g.rollback()
print len(g),[t for t in g]

Results in:

492 [(u'http://del.icio.us/chimezie', u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', u'http://purl.org/rss/1.0/channel'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/link', u'http://del.icio.us/chimezie'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/items', u'QQxcRclE1'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/description', u''), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/title', u'del.icio.us/chimezie')] 0 []

[Chimezie Ogbuji]

via Copia