"Tip: Use the Unicode database to find characters for XML documents"

"Tip: Use the Unicode database to find characters for XML documents"

The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.

The summary says it all, really.

[Uche Ogbuji]

via Copia

Semantic hairball, y'all

I'm in San Jose and the Semantic Technology Conference 2006 has just wrapped up. A good time, as always, and very well attended (way up from even last year. This is an extraordinarily well organized conference). But I did want to throw up one impression I got from one of the first talks I went to.

The talk discussed an effort in "convergence" of MDA/UML, RDF/OWL, Web Services and Topic Maps. Apparently all the big committees are involved, from OMG, W3C, ISO, etc. Having been an enthusiastic early adopter in the first three technologies, I was violently struck by the casually side-stepped enormousness of this undertaking. In my view, all four projects had promising roots and were all eventually buried under the weight of their own complexity. And yet the convergence effort that's being touted seems little more sophisticated than balling all these behemoths together. I wonder what's the purpose. I can't imagine the result will be greater adoption for these technologies taken together. Many potential users already ignore them because of the barrier of impenetrable mumbo-jumbo. I can't imagine there would be much cross-pollination within these technologies because without brutal simplification and profiling model mismatches would make it impractical for an application to efficiently cross the bridge from one semantic modeling technology to the other.

I came to this conference to talk about how Microformats might present a slender opportunity for semantic folks to harness the volume of raw material being generated in the Web 2.0 craze. The trade-off is that the Web 2.0 craze produces a huge amount of crap metadata, and someone will have to clean up the mess in the resulting RDF models even if GRDDL is ever deployed widely enough to generate models worth the effort. And let's not even start on the inevitable meltdown of "folksonomies" (I predict formation of a black hole of fundamental crapitational force). I replaced my previous year's talk about how managers of controlled information systems could harness XML schemata for semantic transparency. I think next year I should go back to that. It's quite practical, as I've determined in my consulting experience. I'm not sure hitching information pipelines to Web 2.0 is the least bit practical.

I'm struck by the appearance of two extremes in popular fields of distributed information management (and all you Semantic Technology pooh-pooh-ers would be gob-smacked if you had any idea how deadly seriously Big Business is taking this stuff: it's popular in terms of dollars and cents, even if it's not the gleam in your favorite blogger's eye). On one hand we have the Daedalos committee fastening labyrinth to labyrinth. On the other hand we have the tower of Web 2.0 Babel. We need a mob in the middle to burn 80% of the AI-one-more-time-for-your-mind-magic off of RDF, 80% of the chicago-cluster-consultant-diesel off of MDA, 80% of the toolkit-vendor-flypaper off of Web services. Once the ashes clear, we need folks to build lightweight tools that actually would help with extracting value from distributed information systems without scaring off the non-Ph.D.s. I still think XML is the key, and that XML schema systems should have been addressing semantic transparency from the start, rather than getting tied up in static typing bondage and discipline.

I have no idea whether I can do anything about the cluster-fuck besides ranting, but I'll be squeezing neurons hard until XTech, which does have the eminent advantage of being an in-person meeting of the semantic, XML and Web 2.0 crowds.

Let's dance in Amsterdam, potnas.

See also:

[Uche Ogbuji]

via Copia

Closed World Assumptions, Conjunctive Querying, and Oracle 10g

I promised myself I would write at least one entry related to my experience at the 2006 Semantic Technology Conference here in San Jose, which has been an incredibly well attended and organized conference. I've found myself wanting to do less talking and more problem solving lately, but I came across an issue that has generated the appropriate amount of motivation.

For some time I've been (eagerly) monitoring Oracle's recent advances with their latest release (10g R2) which (amongst other things) introduced (in my estimation) what will turn out to be a major step in bridging the gap between the academic dream of the Semantic Web and the reality of the day-to-day problems that are relevant to technologies in that sphere of influence.

But first things first (as the saying goes). Basically, the Oracle 10g R2 RDF implementation supports the logical separation of RDF triples into named Models as well as the ability to query across explict sets of Models. However, the querying mechanism (implemented as an extension to SQL – SDO_RDF_MATCH) doesn't support the ability to query across the entire fact / knowledge base – i.e., the aggregation of all the named Models contained within.

I like to refer to this kind of a query as a Conjunctive Query. The term isn't mine, but it has stuck, and has made its way into the rdflib Store API. In fact, the rdflib API now has the concept of a Conjunctive Graph which behaves like a named graph with the exception that the query space is the set of all named graphs in the knowledge base.

Now, it would be an easy nitpick to suggest that since the formal RDF Model doesn't provide any guidance on the separation of RDF triples into addressable graphs, implementors can not be held at fault for deciding not to support such a separation. However, the large body of literature on Named Graphs as well as the support for querying within named sets of graphs in the more contemporary RDF querying languages does suggest that there is real value in separating raw triples this way and in being able to query across these logical separations transparently.

I think the value is twofold: Closed World Assumptions and query performance. Now, the notion of a boundary of known facts, will probably raise a red flag amongst semantic web purists and some may suggest that closed world assumptions cut against the grain of a vision of a massively distributed expert system. For the uninitiated, open world assumptions are where the absence of an assertion in your fact base does not necessarily suggest that the assertion (or statement) isn't true. That is, if the statement 'the sky is blue' is not in the knowledge base, you can not automatically assume that the sky is not blue.

This limitation makes sense where the emphasis is on the distribution of data (a key component of the semantic web vision), however it essentially abandons the value in applying the formal semantics of RDF (and knowledge representation, in general) to closed systems – systems where the data is complete to a certain extent and makes sense to live in a query silo.

The most practical example I can think of is the one I work with daily: medical research data that is often subjected to statistical analysis for deducing trends. You can't make suggestions derived from statistical trends in your data if you don't have some minimal confidence that the set of data you are working with is 'complete' enough to answer the questions you set out to ask.

Closed world assumptions also open the door to other heuristic optimizations that are directly relevant to query processors.

Finally, where RDF databases are built on top of SQL stores, being able to partition your query space into an additional indexable constraint (I say additional, because there are other partitioning techniques that impact scalability and response) makes a world of difference in a framework that has already been rigorously tuned to take maximal advantage of such rectangular partitioning. To a SQL store implementing an RDF model, the name (or names) of a graph is a low cardinality, indexable, constraint (there will always be less graphs than total triples) that can be the difference of several orders of magnitude in the overall query response time.

Named contexts lend themselves quite well to two-part queries where the first part identifies a set of named graphs (within a conjunctive graph or known universe) that match a certain criteria and then query only within those matching graphs. Once the query resolver has identified the named graphs, the second part of the query can be dispatched in a very targeted fashion. Any RDF knowledge base that takes advantage of the logical seperation that named graphs provide will inevitably find itself being asked such questions.

Now I've said all this not to berate the current incarnation of Oracle's RDF solution but to take the opportunity to underline the value in a perspective that is often shoved aside by the vigor of semantic web evangelism. To be fair, the inability to dispatch conjunctive queries is pretty much the only criticism of the Oracle 10g R2 RDF model. I've been aware of it for some time, but didn't want to speak to the point directly until it was 'public knowledge.'

The Oracle 10g R2 RDF implementation demonstrates amongst other things:

  • Namespace management
  • Interned identifiers
  • Reification
  • Collections / Containers
  • Forward-chained rule firing (with a default ruleset for RDFS entailment)
  • Off the chart volume capability (.5 - 5 second response time on 80 Million triples - impressive regardless of the circumstance of the benchmark)
  • Native query format (SPARQL-ish SDORDFMATCH function)

You can check it out the DBA manual for the infrastructure and the uniprot benchmarks for performance.

I've too long been frustrated by the inability of 'industry leaders' to put their money where their mouth is when it comes to adoption of 'unproved' technologies. Too often, the biggest impedance to progress from the academic realm to the 'enterprise' realm is politics. People who simply want to solve difficult problems with intelligent and appropriate technologies have their work cut out for them against the inevitable collisions with politics and technological camp warefare (you say microformats, I say architectural forms, you say SOA, I say REST). So for that reason, it makes me somewhat optimistic that a company that truly has every thing to lose in doing so decided to make such a remarkable first step. Their recent purchase of Berkeley DB XML (the most widely supported open-source Native XML datastore) is yet another example of a bold step towards ubiquitous semi-structured persistence. But please, top it off with support for conjunctive queries.

[Uche Ogbuji]

via Copia

Four Mozilla/XML bugs to vote on (or to help with)

In a recent conversation with colleagues some of the limitations of XML processing in Mozilla came up. I think some of these are really holding Mozilla and Firefox back from being a great platform for XML processing, and so I wanted to highlight them here. Remember that the key to bringing attention to an important bug/request is to vote for it in the tracker, so please consider doing so. I already have done.

18333: "XML Content Sink should be incremental". The description says it all:

Large XML documents, such as the W3C's XSLT spec, take an incredibly long time to load into view source. The browser freezes/blocks (is "not responding" according to Windows) while it processes, and finally unlocks after the entire source of the document is ready for display.

Firefox will never really be a friendly platform for XML processing until this is addressed. There is not really a problem addressing this using the Mozilla's underlying parser, Expat. Worst case one could use that parser's suspend/resume facility (we recently took advantage of this to allow Python-generator-based access to 4Suite Saxlette parsing). The real issue is the amount of work that would need to be done across the Mozilla code base. Unfortunately, Mozilla insiders have been predicting a fix for this problem for a while, and unless there's a sudden boost in votes or better yet resources to help fix the problem, I'm not feeling very optimistic.

69799: "External entities are not included in XML document". Using Betty Harvey's example,

<!ENTITY extFile SYSTEM "extFile.xml">

Is rendered as if Mozilla read


Of course you have to watch out for XSS type attacks, but I imagine Mozilla could handle this the same way it does loaded stylesheets: by restricting to same host domain as the document entity.

193678: "support exslt:common". The node-set extension function is pretty much required for complex XSLT processing, so support from Mozilla would really help open up the landscape of what you can do with XSLT in the browser.

98413: "Implement XML Catalogs". A request to implement OASIS Open XML Catalogs. This could do a lot to encourage support for external entities because some performance problems could be reduced by using a catalog to load a local version of the resource.

A few on my personal would-be-nice-to-fix-but-not-essential list are:

See also:

[Uche Ogbuji]

via Copia

Business-grade broadband in Superior, Colorado

Looking into upgrading my home broadband to business grade, I've casted about for options. Comcast is my current provider for digital cable plus cable broadband, and I've been happy with them, so I called to ask what they could do for me.

Their quote for (Comcast Workplace Standard) was $110 per month for 6.6Mb down and 768Kb down and 5 static IPs, with a promo waiving installation fee and taking $20 per month off for the first 12 months (requires a 24 month commitment).

Sounded good, and seems it must be. I went hunting around sites such as DSLBroker to find business broadband options for Superior, Colorado, where I live. I found nothing but plans on the order of $200/mo for 1.1Mb symmetric; overall I didn't find anything that even came close to the value Comcast was offering. Does anyone know whether I'm just missing something? Do you have experience with business-grade broadband at home?

[Uche Ogbuji]

via Copia

The Loss of a Great Author - Octavia Butler

I just found out earlier today, that Octavia Butler died at the age of 58. I really cannot fully express how her body of work has effected me on a very personal level. CNN has a short entry about her passing that is probably appropriate to anyone who has never read any of her work - especially in regards to why she was more than just yet another science fiction author.

For me, the thing about her work that impressed upon me (more than anything else) was her ability to blend standard science fiction components (futuristic settings, super natural characters and scenarios, underlying science themes, etc..) with incredibly unique, believable, and powerful characters. Not just strong characters, but characters that found themselves in situations that posed very profound questions about issues of race, gender, culture, and the human condition (with a heavy enphasis on the first two).

I had always felt after reading her literature that the issues were addressed and presented much more effectively (in the science fiction context) than if they were works of non-fiction, biographical literature, essays, or other more 'formal' literary forms.

The most poignant of her characters were black women and she had a way of writing black female characters with such authenticity that it often left me expecting the same authenticity from authors in the same genre (or in general). It was this authenticity (that I couldn't find anywhere else) that kept me coming back to her work, hungry for more glimpses of her vivid, anthropological tapestry.

Character development, especially in stories where the plot is more creative than the norm, is so easily overlooked and underappreciated. For my money, there were few authors who could create characters that stuck with you long after the plots (through which they developed) were lost and displaced by recent readings. Octavia Butler was one of them.

My favorite of her books were:

  • Wildseed (From the trilogy: Wildseed, Mind of My Mind, and Pattern Master)
  • Parable of the Sower (From the series: Parable of the Sower and Parable of the Talents)

I had (wistfuly) hoped to meet her someday and ask her (amongst a long list of other, related questions) whether she had come to conclusions / resolutions of her own regarding the depravity of the human condition that she so eloquently described in her books, but I'll never get that oppurtunity.

From one of your most enthusiastic of fans: Rest In Peace, Miss. Butler. You would have made Lao Tse (and his contemporaries - in their time) proud.

Chimezie Ogbuji

via Copia

Schematron creeping on the come-up (again)

Schematron is the boss XML schema language your boss has never heard of. Unfortunately it's had some slow times of recent, but it's surged back with a vengeance thanks to honcho Rick Jelliffe with logistical support from Betty Harvey. There's now a working mailing list and a Wiki. Rick says that Schematron is slated to become an ISO Standard next month.

The text for the Final Draft Internation Standard for Schematron has now been approved by multi-national voting. It is copyright ISO, but it is basically identical to the draft at www.schematron.com

The standard is 30 pages. 21 are normative, including schema listings and a characterization of Schematron semantics in predicate logic. Appendixes show how to use other query language bindings (than XSLT1), how to use Schematron as a vocabulary, how to express multi-lingual dignostics, and a simple description of the design requirements for ISO Schematron.

Congrats to Rick. Here's to the most important schema language of them all (yes, I do mean that). I guess I'll have to check Scimitar, Amara's fast Schematron processor for compliance to the updated draft standard.

[Uche Ogbuji]

via Copia


Interesting (as always) musings from Rick Jelliffe:

There has been good work on the theoretical classification of schema languages over the last two or three years.

My impression is that as soon as your schema language supports IDREF it is stuffed, from an NP POV: Schematron, XSD, RELAX NG, DTDs, the lot!

Theoretical classification is important for know what the characteristics of things are, and what pathological problems implementations may have to deal with. But people who reject one language or another on theoretical grounds alone, without considering their pragmatic value, need to have an alternative otherwise they are troll-like.

The emphasis is mine, marking the bit that caught my attention. So my guess is that in his theory IDREF==NPC (NP complete). Interesting. I haven't done any formal analysis, but when I ponder it, I can imagine ID reference checking as similar to problems I know to be P. I can't really think off head of a similar, known NPC problem, but that doesn't mean much. It can be fairly subtle factors of a problem that establish its NP profile.

It is worth noting, as Eric van der Vlist once had to remind me, that ID/IDREF integrity support is not mandated in RELAX NG validators. Another example of the far-sightedness of Mr. Clark, Murata-san and co? From "The Design of RELAX NG":

The RELAX NG TC spent a considerable amount of time considering what support RELAX NG should provide for enforcing identity (uniqueness and cross-reference) constraints. In the end, the conclusion was that identity constraints were better separated out into a separate specification. Accordingly, RELAX NG itself provides no support for identity constraints. RELAX NG DTD Compatibility 12 provides support for traditional XML ID/IDREF attributes. There were a number of reasons for preferring separation. One reason is the relative difference in maturity. RELAX NG is based on finite tree automata; this is an area of computer science that has been studied for many years and is accordingly mature and well understood. The use of grammars for specifying document structures is based on more than 15 years of practical experience. By contrast, the area of identity constraints (beyond simple ID/IDREF constraints) is much less mature and is still the subject of active research. Another reason is that it is often desirable to perform grammar processing separately from identity constraint processing. For example, it may be known that a particular document is valid with respect to a grammar but not known that it satisfies identity constraints. The type system of the language that was used to generate a document may well be able to guarantee that it is valid with respect to the grammar; it is unlikely that it will be able to guarantee that it satisfies the identity constraints. A document assembled from a number of components may guaranteed to be valid with respect to a grammar because of the validity of the components, but this will often not be the case with identity constraints. Even when a document is known to satisfy the identity constraints as well as be valid with respect to the grammar, it may be necessary to perform identity constraint processing in order to allow application programs to follow references. Another reason is that no single identity constraint language is suitable for all applications. Different applications have identity constraints of vastly different complexity. Some applications have complex constraints that span multiple documents 22. Other applications need only a modest increment on the XML ID/IDREF mechanism. A solution that is sufficient for those applications with complex requirements is likely to be overkill for those applications with simpler requirements.

Well reasoned, like almost everything in RELAX NG (and Schematron).

[Uche Ogbuji]

via Copia

"Tip: Use data URIs to include media in XML"

"Tip: Use data URIs to include media in XML"

There are many ways to link to non-XML content within XML, including binary content. Sometimes you need to roll all such external content directly into the XML. Data scheme URIs are one way to specify a full resource within a URI, which you can then use in XML constructs. In this tip, Uche Ogbuji shows how to use this to bundle related media into a single file.

I also touch a bit on unparsed entities and notations in this brief article.

Side note: Of course URLs are a subset of URIs, but I did want to mention that I prefer to use the term "URI" for the data scheme because it feels to me much more of an identifier-by-value than a locator. (I suppose it could be considered a trivial locator.)

[Uche Ogbuji]

via Copia

Mystery of Google index drop solved?

Update. Corrected Christian's surname. Sorry, man.

A while ago I complained that uche.ogbuji.net disppeared from Google search results soon after I went to a CherryPy-based server. I'm up to say that I'm a goof, but I hope that admitting my silly error might save someone else some head-scratching (maybe even this gentleman)

I'm at least not alone in my error. The clue came from this message by the very smart Christian Wyglendowski In my case I was getting 404s for most things, but I did have a bug that was causing a 500 error on requests to robots.txt. Apparently the Google bot shuns sites with that problem. I can understand that but it's interesting that Yahoo doesn't seem to do the same thing, since my ranking didn't drop much there. I fixed the bug and then submitted a reinclusion request to Google following the suggestions in this article (I guess SEO advice isn't a completely parasitic endeavor). The body of my message was as follows:

I had a bug causing 500 error on robots.txt request, and I think that's why I got dropped from your index. I've fixed that bug, and would like to request reinclusion to your index. Thanks.

We'll see if that does the trick.

[Uche Ogbuji]

via Copia