Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia