Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia

2006 conferences, part 3

Semantic Technology Conference 2006

Copia wasn't around for me to post about what a great time I had at STC05, which was in downtown San Francisco, but I did type up some notes in my developerWorks column . It doesn't hurt that my presentation was packed (pretty big room, too) with very attentive folks who provided excellent feedback throughout the rest of the conference. I'm wondering whether to propose the same talk again (updated, of course). I'm definitely looking forward to the next one. (March 6-9 in San Jose). STC05 was quite visibly a big success and if this one is as well organized, I expect it will become a fixture in my calendar year to year.

Three is probably my conference quotient for next year, but who knows? Things may work out for me to wander yet more. Extreme is the conference that I regret missing most year after year, but August is almost always an impossible time for me. We'll see about 2006.

[Uche Ogbuji]

via Copia

2006 conferences, part 2

"Save the date: XTech 2006"

The foundations are in place for my favorite XML conference. Have I mentioned how hard XTech 2005 rocked? Hmm? Haven't I?. Ah. Indeed I have. The reviews were pretty much overwhelmingly bright.

And perhaps the best news of all is that the conference is still in Amsterdam, but it's moving from the sterile RAI to the Krasnapolsky Hotel on Dam Square. More time to frolic in that fly city center.

As for XML 2005, Atlanta, I'd thought briefly about going (submitting a late-breaking talk and all that), but I'm through my conference quota for the year, and looking through the program I honestly didn't find enough to inspire me to make the stretch. I do hope they follow Edd's lead and bring in some fresh blood (topics wise) for future conferences.

[Uche Ogbuji]

via Copia

2006 conferences, part 1

"Dallas PyCon bid accepted"

Yay! PyCon moves out of the grey D.C. area. And it moves for 2006 to one of my favorite cities, Dallas, where I lived (Irving) from 1994 to 1996. PyCon comes at what lately has been a tough time of year for me to attend, but I think I'll just have to scrape together the odds for 2006. After all, it will be a chance to spend some time back at my favorite martial arts school (note to Dallas folk. If you're considering martial arts training, don't even think twice: get yourself to Tim Bulot's academy.)

w.r.t Andrew's disclaimer. Never fear. All signed, sealed and delivered, now.

[Uche Ogbuji]

via Copia

"The Triumph of Bullshit"

"Bullshit: invented by T.S. Eliot in 1910?"—Mark Liberman, Language Log

This entry discusses one of the conjectures for the origin of the word "bullshit", including discussion of a characteristically phlegmatic poem by T.S. Eliot. Eliot has always been a very nasty sort, and you can perceive that from far less than a reading of "Burbank with a Baedeker: Bleistein with a Cigar" or accounts of his treatment of his first wife, Vivienne. As with most student poets, I'm in awe of his genius, and intend to learn as much from him as possible in a literary sense, but I find him in many ways a personally despicable figure. Even Ezra Pound, who paid dearly for his own egotistic sense of mores, is a far more sympathetic figure. His punishment was excessive (especially considering the general hypocrisy of his prosecution), and he did repent much of his petty bigotry late in life.

I don't remember having ever seen the Eliot poem quoted in the above article, though I've found a lot of Eliot rarities. It's likely that if I did, I shrugged it out of my memory. It uses classic Ballade structure, three stanzas and an envoi, with an unconventional rhyme scheme (for the classic overall effect, see, for example, Villon's "L'Épitaph (Ballade des pendus)". Eliot translates the passion of Ballade into plain spite.

Ladies, who find my intentions ridiculous
Awkward insipid and horribly gauche
Pompous, pretentious, ineptly meticulous
Dull as the heart of an unbaked brioche
Floundering versicles feebly versiculous
Often attenuate, frequently crass
Attempts at emotions that turn isiculous,
For Christ's sake stick it up your ass.

Eliot—second stanza of "The Triumph of Bullshit"

Horrid genius. Eliot attaches several senses to "ladies", including (and this is the sense that does find best concord with the poem), the society matrons who influenced popular, and hence critical, taste. But Eliot is also a bit of a coward here. What is it that he did finally offer the "ladies", that made his fortune?

Time for you and time for me.
And time yet for a hundred indecisions,
And for a hundred visions and revisions,
Before the taking of a toast and tea.

In the room the women come and go
Talking of Michelangelo.

And indeed there will be time
To wonder, "Do I dare?" and, "Do I dare?"
Time to turn back and descend the stair,
With a bald spot in the middle of my hair—

Sure he's still lampooning the Society of Taste, but he doesn't in public dare not to put himself under the glass as well, and seeks indulgence and sympathy as an object of ridicule.

There is also his extraction of Ophelia from Vivienne (or was that Viv doing herself?) from "A Game of Chess": Good night, ladies, good night, sweet ladies, good night, good night. Spoken with a knowing wink.

No, when it's time for brave, open sally, Eliot prefers weak targets. My thanks to Mark, though, for finding a poem that is as interesting as a badge of character and illustration of craft as it is an etymological marker.

[Uche Ogbuji]

via Copia

Amara goes 1.0, gets simpler to install

I released Amara 1.0 today. It's been properly cooked for a while, but life always has its small interruptions. The big change in 1.0 is that I now offer it in a couple of package options.

For those who just need some XML processing, and don't really care about RDF, all you need is Python and Amara-1.0-allinone (grab it from the FTP site). It has a trimmed down subset of 4Suite bundled for one-step install.

For those who want the full complement of what 4Suite has to offer, or for those who've already installed 4Suite anyway, there is the stand alone Amara-1.0 package.

Here's the thing, though. Right now it feels to me that I should be pushing Amara-allinone and not Amara+4Suite. Amara-allinone contains all the 1.0 quality components of 4Suite right now. 4Suite is a combination of a rock solid core XML library, a somewhat out of date RDF library, and a quite rickety server framework. This has been bad for 4Suite. People miss out on the power of its core XML facilities because of the size and uneven quality of the rest of the package. In fact, 4Suite has been stuck in the 1.0 alpha and beta stages for ever, not because the core libraries aren't 1.0 quality (heck, they're 4.x quality), but because we keep hoping to bring the rest of the package up to scratch. It's been on my mind for a while that we should just split the package up to solve this problem. This is what I've done, in effect, with this Amara release.

As for the rest of 4Suite, the RDF engine just needs a parser update for the most recent RDF specifications. The XML/RDF repository probably doesn't need all that much more work before it's ready for 1.0 release. As for the protocol server, as I've said several times before, I think we should just chuck it. Better to use another server package, such as CherryPy.

As for Amara, I'll continue bug fixes on the 1.x branch, but the real fun will be on the 2.x branch where I'll be refactoring a few things.

[Uche Ogbuji]

via Copia

alt.unicode.kvetch.kvetch.kvetch

Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have people chosen to forget how many billions of people there are on the planet? How many languages, cultures and writing systems are represented in this number? What the heck do you expect besides complex and arbitrary? Or do people think it would be better to exclude huge sections of the populace from information technology just to make their programmer lives that marginal bit easier? Or maybe people think it would be easier to deal separately with each of the hundreds of local encoding text schemes that were incorporated into Unicode? What the hell?

Listen. Unicode is hard. Unicode will melt your brain at times, it's the worst system out there, except for all the alternatives. (Yeah, yeah, the old chestnut). But if you want to produce software that is ready for a global user base (actually, you often can't escape the need for character i18n even if you're just writing for your hamlet), you have little choice but to learn it, and you'll be a much better developer once you've learned it properly. Read it from the well-regarded Joel Spolsky: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". And remember, as Joel says, "There Ain't No Such Thing As Plain Text".

As for Python, that language has an excellent Unicode implementation. A group of brilliant contributors tackled a barge load of pitfalls and complexities involved in shunting full Unicode support into the workings of a language that was already in heavy use. I think they nailed most of the compromises, and again a lot of the things that people rail at as arbitrary and obstructing are so precisely because other (in some cases superficially attractive) alternatives are even worse. Almost all languages that support Unicode have snags of their own (see above: Unicode is hard). It's instructive to read Norbert Lindenberg's valediction in World Views. Norbert is the technical lead for Java internationalization at Sun and in this entry he summarizes the painful process of i18n in Java, including Unicode support. Considering the vast resources available for Java development you have to give Python's developers a lot of credit for doing so well with so few resources.

The most important thing for Pythoneers to remember is that if you go through the discipline of using Unicode properly in all your APIs, as I urge over and over again in my Python XML column, you will not even notice most of these snags. My biggest gripe with Python's implementation: confusion between code points and storage units in several of the "string" APIs, is addressed in practice (through theoretical fudge, to be precise) by always using Python compiled for UCS4 character storage, as I discussed at the beginning of "More Unicode Secrets". See that article's sidebar for more background on this issue.

Because Python currently has both legacy strings and Unicode objects and because the APIs for these overlap, there is some room for confusion (again, there was little practical choice considering reasonable requirements for backwards compatibility). The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects.

It is worth a word on the most common gripe: ASCII for the default site encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:

print <unicode>, <int>, <unicode>, <string>

Then to be safe you have to do:

print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),

or:

out = codecs.getwriter(mylocale_encoding)(sys.stdout)
print >> out, <unicode>, <int>, <unicode>, <string>

or one of the other variations on this theme. The thing is that in Python using encoding assumed from locale would be a bit of an all-or-nothing problem, which means that the most straightforward solution to this need would unravel all the important compromises of Python's Unicode implementation, and cause portability problems. Just to be clear on what this specious solution is, here is the relevant quote from the Unicode PEP: # 100:

Note that the default site.py startup module contains disabled optional code which can set the according to the encoding defined by the current locale. The locale module is used to extract the encoding from the locale default settings defined by the OS environment (see locale.py). If the encoding cannot be determined, is unkown or unsupported, the code defaults to setting the to 'ascii'. To enable this code, edit the site.py file or place the appropriate code into the sitecustomize.py module of your Python installation.

Don't to this. I admit the wart, but I don't see it as fundamental. I just see it as a gap in the API. If we had a cleaner way of writing "print-according-to-locale", I think we could close that docket.

But the bottom line is that glibly saying that "Unicode sucks" or that "Python's Unicode sucks" because of these inevitable pitfalls is understandable as vented frustration when tackling programming problems (heck, integers suck, pixels suck, binary logic sucks, the Internet sucks, and so on: you get the point). But it's important for the rant readers not to get the wrong impression. Pay attention for just a minute:

  • Unicode solves a complex and hard problem, trying to make things as simple as possible, but no simpler
  • Python's Unicode implementation reflects the necessary complexity of Unicode, with the addition of compatibility and portability concerns
  • You can avoid almost all the common pitfalls of Python and Unicode by applying a discipline consisting of a handful of rules
  • If you insist on the quickest hack out of a Unicode-related problem, you're probably asking for a dozen other problems to eventually take its place

My recent articles on Unicode include a modest XML bent, but there's more than enough on plain old Python and Unicode for me to recommend them:

At the bottom of the first two are resources and references that cover just about everything you ever wanted or needed to know about Unicode in Python.

[Uche Ogbuji]

via Copia

J-Horror week, or cultural roots of tastes in horror movies

This past weekend I took the opportunity of vacation time and irregular sleeping schedule (due to our newest family member) to catch up on some Japanese horror flicks I've been meaning to watch. I only got as far as Ringu and Ju-on: The Grudge. I thought they were brilliant, and will have to seek out more J-Horror. I have not watched the American versions of either, and I've found the inevitable comparisons I read on them to be a very interesting exemplar of cultural differences. Warning: Spoilers follow.

I've always had a love for mythologies, and Japan has a marvelous mythic tradition, so I've read plenty of Japanese mythology. Maybe that's why I don't have a problem with the disjointed nature of J-Horror. The unraveling threads, the lack of moral simplicity, the indirect methods for the scares are all quite palatable to me. It's not much unlike the West African mythic tradition, and in fact, it's a lot like the Greek mythic tradition. The Romans began the process of rationalizing the Greek myths, and since then Western tastes have been for clean story lines and moral certitude.

One preference that seems more specifically American than generally Western is for literal representation, even in horror. I've heard a lot of criticism from Americans of the look of the monsters in "Ju-on: The grudge" they're plied with white makeup and then blued up, and how is this supposed to be scary when we've seen the twisted monsters of "Evil Dead" and "Night of the Living Dead"? The interesting thing is that, though I love both of those movies, they have always represented comedy as much as horror to me.

Jack of diamonds! Jack of spades! Whhhhyyyyy dooooo yyyyoooouuu distuuuurb my sluuumber

That shit has me rolling, yo. Anyway, complaints about the less obviously grotesque J-Horror demons seems ludicrous to me. Do people really claim that they're more terrified of the fully revealed than of the unformed and unknown? Who cares if Toshio looks like an overgrown blue baby when he clearly establishes himself as the omen of his murderous parents (you know that the big spooks are coming, but you're not sure just how, and that distills a real dread). Who cares that Sadako is all black, wet hair when her slow, purposeful stagger and the one briefly exposed eye are so thoroughly menacing? I find that in my reaction to J-Horror, the chills come far more from the menace of the characters, than from the effects of the characters. I use the word "effects" not only to mean the special effects that bring the characters to life, but also the effects of their actions on people. Sadako stops people's hearts. She doesn't find creative ways to maul their bodies, as is the general formula in American horror. Much of the fright comes from the lack of clear moral to her vengeance. I've heard people ridicule "Ringu" because they have no idea why Sadako cursed a tape for the general public rather than seeking out the specific ones who wronged her, but that arbitrariness is exactly what is so horrible to me.

When the monsters do appear in J-Horror, it's not with flawless CGI that you're thrilled: it is the mannerisms. Sadako's walk. Kayako's bulging eyes. When Sadako crawls out of the TV, and when Kayako does that awful staircase descent, I found myself more thrilled than in any other horror moment I could remember. I don't carry horror movies into my dreams very often, but these ones left me pretty jumpy for a couple of days.

I do have to watch out for unfair generalization. There have been examples of successful American horror that prefers subtler methods (and is thus to me much scarier). But I think that it's quite typical of Hollywood horror to prefer the explicit and gory to the understated and symbolic.

Ringu BTW, seems to me Ringu draws on the famous Japanese ghost story of Okiku's well. There are several variations on the story, but in the one I have in mind, after Okiku was killed and thrown into the well by the Samurai, she climbed out of the well every night as a yūrei avenging spirit, tormenting the Samurai until he went mad, thus extracting her revenge. Of course Sadako is no yūrei, which I've never heard of to be poltergeists (to use the Western term). She was telekinetic in life and retained her action-at-a-distance capabilities to a terrifying degree in the afterlife. Tomoko, on the other hand, who inexplicably urges her cousin to watch the same tape that killed her, seems more of a usual yūrei: seen and heard, but not physically threatening.

One way or another, if you want a little different take on the horror genre, give J-Horror a try, and keep your mind open. For my part, I'll have to catch some more. I already have the much-praised A Tale of Two Sisters in the Netflix queue. If you know of other such flicks that I should make a beeline for (I might go for the Japanese Dark Water, next), drop me a line.

[Uche Ogbuji]

via Copia

XSLT 2.0 and push/pull

I just finally got a chance to read Bob DuCharme's article "Push, Pull, Next!", which starts by referring to my "Push vs pull XSLT". It shows how one might use XSLT 2.0's xsl:next-match to stay with push in some instances where pull becomes attractive. This instruction is similar in idea to XSLT 1.0's xsl:apply-imports, except that it doesn't require you to organize templates into separate, imported files. It also supports xsl:with-param, which is also available in XSLT 2.0's version of xsl:apply-imports. Bob wasn't clear enough in his article that XSLT 1.0 also has xsl:apply-imports, but that's clarified int he comments. One important aspect of the use of these instructions in XSLT 2.0 is that xsl:with-param becomes so much more useful in the new version now that default template rules no longer discard parameters. XSLT 2.0 did manage here to squash one of the bigger gotchas in XSLT 1.0.

[Uche Ogbuji]

via Copia