Mi...cro...for...mats...sis...boom...BLAH!

Mike Linksvayer had a nice comment on my recent talk at the Semantic Technology Conference.

I think Uche Ogbuji's Microformats: Partial Bridge from XML to the Semantic Web is the first talk I've heard on that I've heard from a non-cheerleader and was a pretty good introduction to the upsides and downsides of microformats and how can leverage microformats for officious Semantic Web purposes. My opinion is that the value in microformats hype is in encouraging people to take advantage of XHTML semantics in however a conventional in non-rigorous fashion they may. It is a pipe dream to think that most pages containing microformats will include the correct profile references to allow a spec-following crawler to extract much useful data via GRDDL. Add some convention-following heuristics a crawler may get lots of interesting data from microformatted pages. The big search engines are great at tolerating ambiguity and non-conformance, as they must.

Yeah, I'm no cheerleader (or even follower) for Microformats. Certainly I've been skeptical of Microformats here on Copia (1, 2, 3). I think that the problem with Microformats is that value is tied very closely to hype. I think that as long as they're a hot technology they can be a useful technology. I do think, however, that they have very little intrinsic technological value. I guess one could say this about many technologies, but Microformats perhaps annoy me a bit more because given XML as a base, we could do so much better.

Mike is also right to be skeptical that GRDDL will succeed if, as it presently does, it relies on people putting profile information into Web documents that use Microformats.

My experience at the conference, some very trenchant questions from the audience, A very interesting talk by Ben Adida right after my own, and more matters have got me thinking a lot about Microformats and what those of us whose information consolidation goals are more ambitious might be able to make of them. Watch this space. More to come.

[Uche Ogbuji]

via Copia

"XML in Firefox 1.5, Part 2: Basic XML processing"

"XML in Firefox 1.5, Part 2: Basic XML processing"

Subtitle Do a lot with XML in Firefox, but watch out for some basic limitations

Synopsis This second article in the series, "XML in Firefox 1.5," focuses on basic XML processing. Firefox supports XML parsing, Cascading Stylesheets (CSS), and XSLT stylesheets. You also want to be aware of some limitations. In the first article of this series, "XML in Firefox 1.5, Part 1: Overview of XML features," Uche Ogbuji looked briefly at the different XML-related facilities in Firefox.

I also updated part 1 to reflect the FireFox 1.5 final release.

This article is written at an introductory level. The next articles in the series will be more technically in-depth, as I move from plain old generic XML to fancy stuff such as SVG and E4X.

[Uche Ogbuji]

via Copia

"Tip: Use the Unicode database to find characters for XML documents"

"Tip: Use the Unicode database to find characters for XML documents"

The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.

The summary says it all, really.

[Uche Ogbuji]

via Copia

Semantic hairball, y'all

I'm in San Jose and the Semantic Technology Conference 2006 has just wrapped up. A good time, as always, and very well attended (way up from even last year. This is an extraordinarily well organized conference). But I did want to throw up one impression I got from one of the first talks I went to.

The talk discussed an effort in "convergence" of MDA/UML, RDF/OWL, Web Services and Topic Maps. Apparently all the big committees are involved, from OMG, W3C, ISO, etc. Having been an enthusiastic early adopter in the first three technologies, I was violently struck by the casually side-stepped enormousness of this undertaking. In my view, all four projects had promising roots and were all eventually buried under the weight of their own complexity. And yet the convergence effort that's being touted seems little more sophisticated than balling all these behemoths together. I wonder what's the purpose. I can't imagine the result will be greater adoption for these technologies taken together. Many potential users already ignore them because of the barrier of impenetrable mumbo-jumbo. I can't imagine there would be much cross-pollination within these technologies because without brutal simplification and profiling model mismatches would make it impractical for an application to efficiently cross the bridge from one semantic modeling technology to the other.

I came to this conference to talk about how Microformats might present a slender opportunity for semantic folks to harness the volume of raw material being generated in the Web 2.0 craze. The trade-off is that the Web 2.0 craze produces a huge amount of crap metadata, and someone will have to clean up the mess in the resulting RDF models even if GRDDL is ever deployed widely enough to generate models worth the effort. And let's not even start on the inevitable meltdown of "folksonomies" (I predict formation of a black hole of fundamental crapitational force). I replaced my previous year's talk about how managers of controlled information systems could harness XML schemata for semantic transparency. I think next year I should go back to that. It's quite practical, as I've determined in my consulting experience. I'm not sure hitching information pipelines to Web 2.0 is the least bit practical.

I'm struck by the appearance of two extremes in popular fields of distributed information management (and all you Semantic Technology pooh-pooh-ers would be gob-smacked if you had any idea how deadly seriously Big Business is taking this stuff: it's popular in terms of dollars and cents, even if it's not the gleam in your favorite blogger's eye). On one hand we have the Daedalos committee fastening labyrinth to labyrinth. On the other hand we have the tower of Web 2.0 Babel. We need a mob in the middle to burn 80% of the AI-one-more-time-for-your-mind-magic off of RDF, 80% of the chicago-cluster-consultant-diesel off of MDA, 80% of the toolkit-vendor-flypaper off of Web services. Once the ashes clear, we need folks to build lightweight tools that actually would help with extracting value from distributed information systems without scaring off the non-Ph.D.s. I still think XML is the key, and that XML schema systems should have been addressing semantic transparency from the start, rather than getting tied up in static typing bondage and discipline.

I have no idea whether I can do anything about the cluster-fuck besides ranting, but I'll be squeezing neurons hard until XTech, which does have the eminent advantage of being an in-person meeting of the semantic, XML and Web 2.0 crowds.

Let's dance in Amsterdam, potnas.

See also:

[Uche Ogbuji]

via Copia

Four Mozilla/XML bugs to vote on (or to help with)

In a recent conversation with colleagues some of the limitations of XML processing in Mozilla came up. I think some of these are really holding Mozilla and Firefox back from being a great platform for XML processing, and so I wanted to highlight them here. Remember that the key to bringing attention to an important bug/request is to vote for it in the tracker, so please consider doing so. I already have done.

18333: "XML Content Sink should be incremental". The description says it all:

Large XML documents, such as the W3C's XSLT spec, take an incredibly long time to load into view source. The browser freezes/blocks (is "not responding" according to Windows) while it processes, and finally unlocks after the entire source of the document is ready for display.

Firefox will never really be a friendly platform for XML processing until this is addressed. There is not really a problem addressing this using the Mozilla's underlying parser, Expat. Worst case one could use that parser's suspend/resume facility (we recently took advantage of this to allow Python-generator-based access to 4Suite Saxlette parsing). The real issue is the amount of work that would need to be done across the Mozilla code base. Unfortunately, Mozilla insiders have been predicting a fix for this problem for a while, and unless there's a sudden boost in votes or better yet resources to help fix the problem, I'm not feeling very optimistic.

69799: "External entities are not included in XML document". Using Betty Harvey's example,

<!DOCTYPE myXML[
<!ENTITY extFile SYSTEM "extFile.xml">
]>
<myXML>&extFile;</myXML>

Is rendered as if Mozilla read

<myXML></myXML>

Of course you have to watch out for XSS type attacks, but I imagine Mozilla could handle this the same way it does loaded stylesheets: by restricting to same host domain as the document entity.

193678: "support exslt:common". The node-set extension function is pretty much required for complex XSLT processing, so support from Mozilla would really help open up the landscape of what you can do with XSLT in the browser.

98413: "Implement XML Catalogs". A request to implement OASIS Open XML Catalogs. This could do a lot to encourage support for external entities because some performance problems could be reduced by using a catalog to load a local version of the resource.

A few on my personal would-be-nice-to-fix-but-not-essential list are:

See also:

[Uche Ogbuji]

via Copia

Schematron creeping on the come-up (again)

Schematron is the boss XML schema language your boss has never heard of. Unfortunately it's had some slow times of recent, but it's surged back with a vengeance thanks to honcho Rick Jelliffe with logistical support from Betty Harvey. There's now a working mailing list and a Wiki. Rick says that Schematron is slated to become an ISO Standard next month.

The text for the Final Draft Internation Standard for Schematron has now been approved by multi-national voting. It is copyright ISO, but it is basically identical to the draft at www.schematron.com

The standard is 30 pages. 21 are normative, including schema listings and a characterization of Schematron semantics in predicate logic. Appendixes show how to use other query language bindings (than XSLT1), how to use Schematron as a vocabulary, how to express multi-lingual dignostics, and a simple description of the design requirements for ISO Schematron.

Congrats to Rick. Here's to the most important schema language of them all (yes, I do mean that). I guess I'll have to check Scimitar, Amara's fast Schematron processor for compliance to the updated draft standard.

[Uche Ogbuji]

via Copia

IDREF=NPC?

Interesting (as always) musings from Rick Jelliffe:

There has been good work on the theoretical classification of schema languages over the last two or three years.

My impression is that as soon as your schema language supports IDREF it is stuffed, from an NP POV: Schematron, XSD, RELAX NG, DTDs, the lot!

Theoretical classification is important for know what the characteristics of things are, and what pathological problems implementations may have to deal with. But people who reject one language or another on theoretical grounds alone, without considering their pragmatic value, need to have an alternative otherwise they are troll-like.

The emphasis is mine, marking the bit that caught my attention. So my guess is that in his theory IDREF==NPC (NP complete). Interesting. I haven't done any formal analysis, but when I ponder it, I can imagine ID reference checking as similar to problems I know to be P. I can't really think off head of a similar, known NPC problem, but that doesn't mean much. It can be fairly subtle factors of a problem that establish its NP profile.

It is worth noting, as Eric van der Vlist once had to remind me, that ID/IDREF integrity support is not mandated in RELAX NG validators. Another example of the far-sightedness of Mr. Clark, Murata-san and co? From "The Design of RELAX NG":

The RELAX NG TC spent a considerable amount of time considering what support RELAX NG should provide for enforcing identity (uniqueness and cross-reference) constraints. In the end, the conclusion was that identity constraints were better separated out into a separate specification. Accordingly, RELAX NG itself provides no support for identity constraints. RELAX NG DTD Compatibility 12 provides support for traditional XML ID/IDREF attributes. There were a number of reasons for preferring separation. One reason is the relative difference in maturity. RELAX NG is based on finite tree automata; this is an area of computer science that has been studied for many years and is accordingly mature and well understood. The use of grammars for specifying document structures is based on more than 15 years of practical experience. By contrast, the area of identity constraints (beyond simple ID/IDREF constraints) is much less mature and is still the subject of active research. Another reason is that it is often desirable to perform grammar processing separately from identity constraint processing. For example, it may be known that a particular document is valid with respect to a grammar but not known that it satisfies identity constraints. The type system of the language that was used to generate a document may well be able to guarantee that it is valid with respect to the grammar; it is unlikely that it will be able to guarantee that it satisfies the identity constraints. A document assembled from a number of components may guaranteed to be valid with respect to a grammar because of the validity of the components, but this will often not be the case with identity constraints. Even when a document is known to satisfy the identity constraints as well as be valid with respect to the grammar, it may be necessary to perform identity constraint processing in order to allow application programs to follow references. Another reason is that no single identity constraint language is suitable for all applications. Different applications have identity constraints of vastly different complexity. Some applications have complex constraints that span multiple documents 22. Other applications need only a modest increment on the XML ID/IDREF mechanism. A solution that is sufficient for those applications with complex requirements is likely to be overkill for those applications with simpler requirements.

Well reasoned, like almost everything in RELAX NG (and Schematron).

[Uche Ogbuji]

via Copia

"Tip: Use data URIs to include media in XML"

"Tip: Use data URIs to include media in XML"

There are many ways to link to non-XML content within XML, including binary content. Sometimes you need to roll all such external content directly into the XML. Data scheme URIs are one way to specify a full resource within a URI, which you can then use in XML constructs. In this tip, Uche Ogbuji shows how to use this to bundle related media into a single file.

I also touch a bit on unparsed entities and notations in this brief article.

Side note: Of course URLs are a subset of URIs, but I did want to mention that I prefer to use the term "URI" for the data scheme because it feels to me much more of an identifier-by-value than a locator. (I suppose it could be considered a trivial locator.)

[Uche Ogbuji]

via Copia

Small fix to atom.rnc, and what about xml:space?

RobertBachmann stopped by #atom to mention that he'd tried to run an Atom file on the non-normative RELAX NG for the Atom RFC draft (I haven't seen an RNC for the final RFC itself). It failed because he used xml:lang in an atom:name child of atom:author. This contradicts the Atom spec, which says:

Any element defined by this specification MAY have an xml:lang attribute, whose content indicates the natural language for the element and its descendents.

The RNC did not specify this attribute in a couple of cases. The RNC is non-normative, but in this case there is no reason for divergence from the spec. I whipped up an atom.rnc that fixes the bug. Here's the diff from the version I found on-line.

This did set up a discussion between Anne van Kesteren and me. I feel that xml:lang only makes sense for some Atom elements, and that perhaps allowing it on all of them could be confusing. What, for example, does it mean to have xml:lang on the atom:uri child of atom:author? I suppose an outlandish (pun intended) interpretation could be references to localized sites, but that's really the province of the likes of XHTML's hreflang attribute. Moreover, I'm a bit puzzled by the bit from the Atom spec that seems to support my leaning:

The language context is only significant for elements and attributes declared to be "Language-Sensitive" by this specification.

So if it's not significant, why allow it? I think maybe there should have been a split in attribute sets between atomCommonAttributes and a atomCommonLanguageSensitiveAttributes, where the former would omit xml:lang.

Also, I'm used to the convention where xml:lang is used with content models that allow a language-sensitive element to be repeated, providing for multiple language versions in the same document. There are many cases in Atom where this would not be possible. For example, you could not have an English atom:title and a French one within the same atom:entry element. You could get tricky with by using a single atom:entry with type="xhtml" and multiple language versions within the xhtml:div, but this feels a bit constricting.

Anne doesn't mind xml:lang everywhere, and pointed out that xml:lang="" is an option for specifying no language context (rather than language context inherited from parent). I think in the end I could go either way on xml:lang everywhere.

This discussion also made me think of xml:space. This special attribute might get a mention right in the XML spec, but that doesn't mean it doesn't have to be addressed in XML applications. Even in the case of DTD, the spec says

In valid documents, this attribute, like any other, must be declared if it is used.

The same goes for RELAX NG, the conventional schema language for Atom. There is no xml:space to be found in either the normative RFC or non-normative schema, but the rules for Atom undefinedAttribute do allow for this attribute (as well as xml:id and just about any other XML or 'global' attribute). I assume that the intention is for applications to treat this attribute using the suggested semantics in the XML 1.0 spec. I do wish Atom had been explicit about this as is, for example, the XSLT 1.0 spec.

[Uche Ogbuji]

via Copia

Recipe for freezing 4Suite or Amara apps (cross-platform)

Updated based on user experience.

Recently a user mentioned having trouble freezing an Amara app. This question comes up every six months or so, it seems, so I decided to make sure I have a recipe for easy reference. I also wanted to make sure that successful freezing would not require any changes in 4Suite before the next release. I started with the most recent success report I could find, by Roman Yakovenko. Actually, his recipe ran perfectly well as is. All I'm doing here is expanding on it.

Recipe: freezing 4Suite or Amara apps

Grab cxFreeze. I used the 3.0.1 release, which I built from source on Fedora Core 4 Linux and Python 2.4.1). Updated: I've updated freezehack.py to work with cxFreeze 3.0.2, thanks to Luis Miguel Morillas.

Grab freezehack.py, which was originally put together by Roman. Add it to your PYTHONPATH.

Add import freezehack to your main Python module for the executable to be created. update actually, based on Mike Powers' experience you might have to add this import to every module that imports amara or Ft.

Freeze your program as usual. Run FreezePython.exe (or FreezePython on UNIX).

See the following sample session:

$ cat main.py
import freezehack
import amara
diggfeed = amara.parse("http://www.digg.com/rss/index.xml")
print diggfeed.rss.channel.item.title

$ FreezePython --install-dir dist --target-name testexe main.py
[SNIP]
Frozen binary dist/testexe created.

$ ./dist/testexe
Guess-the-Google - Simple but addictive game

In order to share the executable you have to copy the whole dist directory to the target machine, but that's all you should need to do. Python, 4Suite, Amara and any other such dependencies are bundled automatically.

Now back to the release.

[Uche Ogbuji]

via Copia