Hooking up an IRC Agent to a Query Interface

Uche gave an excellent suggestion to augment Emeka to work with Triclops. After finishing Triclops, I had realized that most of the functionality Emeka provided was now redundant since it could be performed using Triclops (with the added advantage of being able to diagram/navigate RDF Graphs). Triclops URIs are unfortunately very long for queries submitted through HTTP GET. This is unfortunately unavoidable mostly because the parameters to the query service are Versa queries (which have their own syntax that includes characters that may not be very friendly for URIs) and URIs, so the prospect of submitting URIs to a service identified to a URI will probably always be problematic and might be as fundamental as submitting XML content through an XML-based messaging format. I'm digressing, but the solution that worked for me (which Uche suggested) was to make use of a URI 'shortening' service (such as MakeAShorterLink - which I used) to generate a 'hashed' or 'compacted' form of the URI from the parts submitted from , say, IRC. For example, lets say I was interested in Daniel Krech's redfoot framework and wanted to navigate the ontology (available from the URL: http://redfoot.net/2005/redfoot) as an RDF diagram in order to get an idea of the components and how they related. In particular, lets say I knew to start with the redfoot:Code class, I could interactively ask Emeka to return to execute the query:

redfoot:Code

against the URL which returns the RDF/XML serialization of the redfoot ontology. I could also ask Emeka to have the result rendered as a JPEG diagram with an associated image map that allows me to navigate to other aspects of the ontology to learn more about redfoot. Below is how I would do so with the new .triclops service:

<chimezie> .triclops "redfoot:Code" http://redfoot.net/2005/redfoot jpeg
<Emeka> Shortened link to Triclops request: http://makeashorterlink.com/?I1F7262EB

Below is the help associated with the triclops service.

<chimezie> .help triclops
<Emeka> .triclops "..Versa Query.." <source graph> <triples|jpeg|svg|versa>
<Emeka> The first argument is quoted Versa query to execute.  The second is a URL which points to the RDF graph (N3 or RDF/XML) to query against
<Emeka> The third argument is one of "triples", "jpeg", or "svg" and specifies how to return the query result
<Emeka> "triples" - raw triples in a tabled-grid, "jpeg" or "svg" - as navigable RDF graphs, and "versa" - raw Versa datatypes (rendered as html)
<Emeka> The result is a uri (courtesy of http://makeashorterlink.com) which redirects to the appropriate Triclops request

[Uche Ogbuji]

via Copia

BNode Drama for your Mama

You know you are geek when it's 5am in the morning and you are wrestling with existential quantification and their value in querying. This was triggered originally by the ongoing effort to extend an already expressive pattern-based RDF querying language to cover more usecases. The motivation being that such patterns should be expressive beyond just the level of triple-matching since the core RDF model consists of a level of granularity below statements (you have literals, resources, and bnodes, ..). I asked myself if there was a justifiable reason why Versa at it's core does not include BNodes?:

Blank nodes are treated as simply indicating the existence of a thing, without using, or saying anything about, the name of that thing. (This is not the same as assuming that the blank node indicates an 'unknown' URI reference; for example, it does not assume that there is any URI reference which refers to the thing. The discussion of Skolemization in appendix A is relevant to this point.)

I don't remember the original motivation for leaving BNodes out of the core query data types, but in retrospect I think it it was a good decision and not only because the SPARQL specification does something similar (in interpreting BNodes as an open-ended variable). But it's worth noting that the section on blank nodes appearing in a query as opposed to appearing to a query result (or existing in the underlying knowledge base) is quite short:

A blank node can appear in a query pattern. It behaves as a variable; a blank node in a query pattern may match any RDF term.

Anyways, at the time I noticed this lack of BNodes in query languages, I had a misconception about BNodes. I thought they represented individual things we want to make statements about but don't know their identification or don't want to have to worry about assigning identification about them (this is probably 90% of the way BNodes are used in reality). This confusion came from the practical way BNodes are almost always handled by RDF data stores (Skolemization):

Skolemization is a syntactic transformation routinely used in automatic inference systems in which existential variables are replaced by 'new' functions - function names not used elsewhere - applied to any enclosing universal variables. In RDF, Skolemization amounts to replacing every blank node in a graph by a 'new' name, i.e. a URI reference which is guaranteed to not occur anywhere else. In effect, it gives 'arbitrary' names to the anonymous entities whose existence was asserted by the use of blank nodes: the arbitrariness of the names ensures that nothing can be inferred that would not follow from the bare assertion of existence represented by the blank node.

This misconception was clarified when Bijan Parsia ({scope(PyChinko)} => {scope(FuXi)}) expressed that he had issue with my assertion(s) that there are some compromising redundancies with BNodes, Literals, and simple entailment with regards to building programmatic APIs for them.

Then the light bulb went off that the semantics of BNodes are (as he put it) much stronger than they are most often used. Most people who use BNodes don't mean to use it to state that there is a class of things which have the asserted set of statements made about them. Consider the difference between:

  1. Who are all the people Chime knows?
  2. There is someone Chime knows, but I just don't know his/her name right now
  3. Chime knows someone! (dudn't madder who)

The first scenario is the basic use case for variable resolution in an RDF query and is asking for the resolution of variable ?knownByChime in:

<http://metaacognition.info/profile/webwho.xrdf#chime> foaf:knows ?knownByChime.

Which can be [expressed] in Versa (currently) as:

resource('http://metacognition.info/profile/webwho.xrdf#chime')-foaf:knows->*

Or eventually (hopefully) as:

foaf:knows(<http://metacognition.info/profile/webwho.xrdf#chime>)

And in SPARQL as:

select 
  ?knownByChime 
where  
{
  <http://metacognition.info/profile/webwho.xrdf#chime> foaf:knows ?knownByChime
}

The second case is the most common way people use BNodes. You want to say Chime knows someone but don't know a permanent identifier for this person or care to at the time you make the assertion:

http://metaacognition.info/profile/webwho.xrdf#chime foaf:knows _:knownByChime

But RDF-MT specifically states that BNodes are not meant to be interpreted in this way only. Their semantics are much stronger. In fact, as Bijaan pointed out to me, the proper use for BNodes is as scoped existentials within ontological assersions. For example owl:Restrictions which allow you to say things like: The named class KnowsChime consists of everybody who knows Chime:

@prefix mc <http://metaacognition.info/profile/webwho.xrdf#>.
  @prefix owl <http://www.w3.org/2002/07/owl#>.
  :KnowsChime a owl:Class;
        rdfs:subClassOf 
        [
          a owl:Restriction;
          owl:onProperty foaf:knows;
          owl:hasValue mc:chime
        ];
        rdfs:label "KnowsChime";
        rdfs:comment "Everybody who knows Chime";

The fact that BNodes aren't meant to be used in the way they often are leads to some suggested modifications to allow BNodes to be used as 'temporary identifiers' in order to simplify query resolution. But as clarified in the same thread, BNodes in a query doesn't make much sense - which is the conclusion I'm coming around to: There is no use case for asserting an existential quantification while querying for information against a knowledge base. Using a variable (in the way SPARQL does) should be sufficient. In fact, all RDF querying usecases (and languages) seem to be reducable to variable resolution.

This last part is worth noting because it suggests that if you have a library that handles variable resolution (such as rdflib's most recent addition) you can map any query language to (Versa/SPARQL/RDFQueryLanguage_X) it by reducing it to a set of triple patterns with the variables you wish to resolve.

So my conclusions?:

  • Blank Nodes are a neccessary component in the model (and any persistence API) that unfortunately have much stronger semantics (existential quanitifcation) than their most common use (as temporary identifiers)
  • The distinction between the way BNodes are most often used (as a syntactic shorthand for a single resource for which there is no known identity - at the time) and the formal definition of BNodes is very important to note - especially to those who are very much wed to their BNodes as Shelly Powers has shown to be :).
  • Finally, BNodes emphatically do not make sense in the context of a query - since they become infinitely resolvable variables: which is not very useful. This confusion is further proof that (once again), for the sake of minimizing said confusion and misinterpretation of some very complicated axioms there is plenty value in parenthetically (if not logically) divorcing (pun intended) RDF model theoretics from the nuts and bolts of the underlying model

Chimezie Ogbuji

via Copia

Triclops - Resurrected

Like a pheonix from the flames, I've resurrected an old RDF querying interface called Triclops. It used to be coupled with the 4Suite repository's defunct Dashboard (which I've also tried to resurrect as an XForms interface to a live 4Suite repository - there is more to come on that front, thanks to FormFaces) but I've broken it out into it's own stand alone application. It's driven by this stylesheet which makes use of two XSLT extensions (http://metacognition.info/extensions/remote-query.dd#query and http://metacognition.info/extensions/remote-query.dd#graph) both of which are defined here.

I've updated the tabled,triple result page to include ways to navigate graphs by clicking on subjects (which takes you to a subsequent triple result page for that particular subject), predicates (which takes you to another triple result page with statements which use that predicate), and objects (which allows you to jump from graph to graph along rdfs:seeAlso / rdfs:isDefinedBy relationships). Note that it assumes the objects of rdfs:seeAlso and rdfs:isDefined by are live URIs that return an RDF graph (the most common use for these is for FOAF networks and relationships between ontologies).

I've also included buttons for common, 'canned' queries that can be executed on any graph, such as:

  • All classes: type(list(rdfs:Class,owl:Class))
  • rss:items: type(rss:item)
  • Dated Resources: all()|-dc:date->*
  • Today's Items: all()|-dc:date->contains(.,".. today's date ..")
  • Annotated Resources: all()|-list(rdfs:label,rdfs:comment,dc:title,dc:description,rss:title,rss:description)->*
  • Ontologies: type(owl:Ontology)
  • People: type(foaf:Person)
  • Everything: all()

In addition, I've added some documentation

P.S.: The nodes in the RDF diagrams generated by Triclops (as an alternative to raw triples) are live links. The JPEG diagrams are associated with image maps (generated by graphviz) that allow you to click on the nodes and the SVG diagrams are rendered with links as well (depending on the level of maturity of your SVG viewer - this might or might not be the prefered diagram format). The SVG diagrams have alot potential for things such as styling and the other possibilities that are standard to SVG.

In all, I think it provides a very user-friendly way to quickly traverse, whittle, and circumnavigate the "Semantic Web" - whoops, there is that phrase again :)

4Suite Repository and 4Suite RDF have become sort of bastard children of recent 4Suite development and I've been focusing my efforts lately in moving the latter along - The former (the Repository) only lacks documentation IMHO as Metacognition is run entirely as a 4Suite Repository instance.

Chimezie Ogbuji

via Copia

Is RDF moving beyond the desperate hacker? And what of Microformats?

I've always taken a desperate hacker approach to RDF. I became a convert to the XML way of expressing documents right away, in 1997. As I started building systems that managed collections of XML documents I was missing a good, declarative means for binding such documents together. I came across RDF, and I was sold. I was never really a Semantic Web head. I used RDF more as a desperate hacker with problems in a fairly well-contained domain. At that time the Sem Web aspirations behind RDF didn't get in the way too badly, so all was well for me. My desperate hacker mindset is probably best summarized in this XML-DEV message from may, 2001.

I see RDF as an excellent modeling tool for closed systems. In my practice, most of the real "knowledge" is in the XML documents at the nodes, but RDF can provide important indexing and relationship expression between these nodes.

I go on to in that message expand on where RDF fits into the architecture of apps for my needs. I also mention a bit of wariness about how RDF's extravagant ambition (i.e. Sem Web) could affect my simple, practical needs.

I quickly found out on www-rdf-logic that in the discussion there, the assumption appear to be that in the semantic Web the RDF statements would carry a heavy burden of the "knowledge" in the system. I've started to think that this idea is a straw man set up by folks who would like RDF to be a fully-blown knowledge-representation language, but if "strong RDF" is indeed a cog in the SW wheel, I fear I must excuse myself from contributing to that discussion because It places me immediately out of my depth.

I've spent a lot of time with RDF, and for a while it was a big part of our consulting practice, but recently applications architecture and schema design (RELAX NG mostly, thank goodness) have been the biggest part of the day job. Honestly, I started to lose touch with where RDF was going. I knew there were some common-sense fixes to bugs in the 1999 specs, but I also knew there were some worrying injections of Sem Web think into the model core. Recently I've had some opportunity to catch up. SPARQL just doesn't fit my head, so a few of us in the Versa 1.0 gang, including Mike Olson and Chimezie, have started work towards Versa 2.0. Mike and Chime have kept up with the state of RDF, and in several discussions, I expressed what I felt were simple view of the RDF model and got in response what I thought were overblown claims about how the RDF model's semantics have been updated. In all cases when I checked the relevant parts of the latest RDF specs I found that Mike and Chime were right and it was rather the RDF model itself that was overblown.

I've developed an overall impression of dismay at the latest RDF model semantics specs. I've always had a problem with Topic Maps because I think that they complicate things in search of an unnecessary level of ontological purity. Well, it seems to me that RDF has done the same thing. I get the feeling that in trying to achieve the ontological purity needed for the Semantic Web, it's starting to leave the desperate hacker behind. I used to be confident I could instruct people on almost all of RDF's core model in an hour. I'm no longer so confident, and the reality is that any technology that takes longer than that to encompass is doomed to failure on the Web. If they think that Web punters will be willing to make sense of the baroque thicket of lemmas (yes, "lemmas", mi amici docte) that now lie at the heart of RDF, or to get their heads around such bizarre concepts as assigning identity to literal values, they are sorely mistaken. Now I hear the argument that one does not need to know hedge automata to use RELAX NG, and all that, but I don't think it applies in the case of RDF. In RDF, the model semantics are the primary reason for coming to the party. I don't see it as an optional formalization. Maybe I'm wrong about that and it's the need to write a query language for RDF (hardly typical for the Web punter) that is causing me to gurgle in the muck.

Assuming it were time for a desperate hacker such as me to move on (and I'm not necessarily saying that I am moving on), where would he go from here? I hear the chorus: microformats. But I see nothing but nasty pricklies down that road. IMO microformats are now where RDF was back in 1999 (actually more like 1998) in terms of practical use to the Web, but in making their specification nothing but a few notes scribbled in a WIki, they are purely syntactic, and offer no semantic anchor. As such, I'm not sure why it makes sense to think of microformats as different from XML ca. 1997. What's the news there? They certainly don't solve my desperate hacker need for indexing and expressing relationships across XML documents. I don't need the level of grounding that RDF seems to so slavishly be aiming for these days, but I need more than scattered Wiki notes.

GRDDL is the RDF community's bid to fix microformats up with some grounding. Funny thing is that in GRDDL they are re-discovering what the desperate hackers at Fourthought devised almost four years ago in "document definitions" to map XML syntax to RDF statements using XPath and XSLT. The desperate hacker in me feels at the same time vindicated, and left in the weeds. Sure GRDDL gets RDF some of what I've thought it's needed for ages, but it still does wed microformats to the present-day RDF model, which is just what I'm becoming uneasy about.

I'm more wandering around than getting anywhere in this entry, I freely admit. Working the grounding layer for XML is still what I consider to be my work of primary career interest. Lately, this work has led me more in the direction of schema annotations, as you can see in some of my recent articles on IBM developerWorks. Architectural forms are the closest thing the SGML old-heads gave us to syntax-semantic grounding (grounded to HyTime, of course), and AF were a creature of the schema. Perhaps it's high time we went back to learn that old-head lesson and quit fiddling around with brittle post-schema transformations.

As for the modeling system to use as the basis for grounding XML syntax, I don't know. I stick to RDF for now, but I'll have to see if it's possible to use it interoperably while still ignoring the more esoteric flourishes it's picked up lately. The Versa discussions at first gave me the impression that these flourishes are inevitable, but more recent threads have been a bit more encouraging.

I certainly hope that it doesn't take another rewind to RDF circa 2000 to satisfy the desperate hacker.

[Uche Ogbuji]

via Copia

Itinerant Binds - Better Software Documentation

It was brought to my attention that my recent entry about Sparta/Versa/rdflib possibilities was a little vague/unclear. This tends to happen when I get caught up in an interest. Anyways,.. I renamed the module to Itinerant Binds (I liked the term), created a page on Metacognition for the recent rdflib/4Suite RDF work I've been doing with some more details on how the components works. I added an example that better demonstrates isolating RDF resources through targeted Versa queries and using the bound python result objects to modify / extend the underlying graph.

Chimezie Ogbuji

via Copia

RDF-API: Reconciling the redundancy in pythonic RDF store implementations

I just wrapped up the second of two rdflib-related libraries I wrote with the aim of bridging the gap between rdflib and 4Suite RDF. The latter (BoundVersaResult.py) is a little more interesting than the former in that it uses Sparta to allow the distinct components of a Versa query result to each be bound to appropriate python objects. 4Suite RDF's Versa implementation already provides such a binding:

  • String -> Python unicode
  • Number -> Python float
  • Boolean -> Python boolean
  • List -> Python list
  • Set -> Python Sets
  • Resource/BlankNodes -> Python unicode

The bindings for all the datatypes except Resource/BlankNodes are straight forward. This library extends the datatype binding to include the ability to bind Sparta Things to Versa Resources and BlankNodes. Since Sparta only works with rdflib Graphs, the FtRdfBackend.py module is used to wrap an rdflib.Graph around a 4Suite Model.

Sparta takes an RDF Graph and a defining Ontology which dictates the cardinality of properties bound to resource objects (Things). It allows an RDF Graph to be traversed (and extended) via pythonic idiom. The combination of being able to isolate resources by Versa query (or SPARQL queries eventually - as soon as the ongoing rdflib effort in that regard is completed) and bind them to python objects whose properties reflect the properties on the underlying RDF resources they are bound to is very cool, IMHO. The ability to provide an implementation agnostic way to modify an RDF graph, using a host language as expressive as Python is the icing on the cake. For example, check out the following code snippet demonstrating the use of this library:

#Setup FtRDF Model
Memory.InitializeModule()   
db = Memory.GetDb('', '')
db.begin()
model = Model.Model(db)

#Parse my del.icio.us rss feed
szr = Dom.Serializer()
delUri="http://del.icio.us/rss/chimezie/academic+rdf"
domStr=urllib2.urlopen(delUri).read()        
dom = Domlette.NonvalidatingReader.parseString(domStr,'http://del.icio.us/rss/chimezie')
szr.deserialize(model,dom,scope=delUri)

#Setup rdflib.Graph with FtRDF Model as Backend, using FtRdf driver
generator=VersaThingGenerator(model)
#generator.query("type(rss:item)")
for item in generator.query("type(rss:item)"):        
    [pprint(link) for link in item.rss_link]
    print generator.query("distribute(@'%s','.-rss:title->*','.-dc:subject->*')"%item._id)[0]

Note that (within the loop over the rss:items in the graph), the rss:link property returns an iterator over the possible values (since there is no defining ontology that could have specified that the rss:link property has a cardinality of 1, or is an inverse functional property - which would have caused Sparta to bind the rss_link property to a single object instead of an iterator).

The result of running this code:

u'http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0069'
[[u'More on additional semantic information from Enrico Franconi on 2004-07-12 (public-rdf-    dawg@w3.org from 
July to September 2004)'], [u'academic architecture archive community dawg email logic query rdf reference 
semantic']]
u'http://www.w3.org/TR/swbp-specified-values/'
[[u'Representing Specified Values in OWL: "value partitions" and "value sets"'], [u'academic datatypes ontology owl 
rdf semantic standard w3c']]
u'http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0386.html'
[[u'boolean operators and type errors from Jeen Broekstra on 2005-09-07 (public-rdf-dawg@w3.org from July to 
September 2005)'], [u'academic architecture archive community dawg email logic rdf reference semantic w3c']]
u'http://www.w3.org/DesignIssues/Diff'
[[u'RDF Diff, Patch, Update, and Sync -- Design Issues'], [u'academic paper rdf semantic standards tbl w3c']]
u'http://www.w3.org/TR/rdf-dawg-uc/'
[[u'RDF Data Access Use Cases and Requirements'], [u'academic architecture framework query rdf reference semantic 
specification standard w3c']]
u'http://www.w3.org/DesignIssues/RDB-RDF'
[[u'Relational Databases and the Semantic Web (in Design Issues)'], [u'academic architecture framework rdb rdf 
reference semantic tbl w3c']]
u'http://www.w3.org/TR/swbp-n-aryRelations/'
[[u'Defining N-ary Relations on the Semantic Web: Use With Individuals'], [u'academic logic ontology owl predicate 
rdf reference relationships semantic standard w3c']]

Chimezie Ogbuji

via Copia

Wrapping rdflib's Graph around a 4RDF Model

Well, for some time I had pondered what it would take fo provide SPARQL support in 4Suite RDF. I fell upon sparql-p, earlier and noticed it was essentially a SPARQL query processor w/out a parser to drive it. It works over a deprecated rdflib interface: TripleStore. The newly suggested interface is Graph, which is as solid suggestion for a generic RDF:API as any. So, I wrote a 4Suite RDF model backend for rdflib, that allows the wrapping of Graph around a live 4Suite RDF model. Finally, I used this backend to execute a sparql-p query over http://http://del.icio.us/rss/chimezie:

SELECT
  ?title
WHERE {
  ?item rdf:type rss:item;
        dc:subject ?subj;
        rss:title ?title.
        FILTER (REGEX(?subj,".*rdf")).
}

The corresponding python code:

#Setup FtRDF Model
Memory.InitializeModule()   
db = Memory.GetDb('rules', 'test')
db.begin()
model = Model.Model(db)

#Parse my del.icio.us rss feed
szr = Dom.Serializer()
domStr=urllib2.urlopen('http://del.icio.us/rss/chimezie').read()        
dom = Domlette.NonvalidatingReader.parseString(domStr,'http://del.icio.us/rss/chimezie')
szr.deserialize(model,dom,scope='http://del.icio.us/rss/chimezie')

#Setup rdflib.Graph with FtRDF Model as Backend, using FtRdf driver
g=Graph(FtRdf(model))

#Setup sparql-p query processor engine
select = ("?title")

#Setup term
copia = URIRef('http://del.icio.us/chimezie')
rssTitle = URIRef('http://purl.org/rss/1.0/title')
versaWiki = URIRef('http://en.wikipedia.org/wiki/Versa')
dc_subject=URIRef("http://purl.org/dc/elements/1.1/subject")

#Filter on objects of statements (dc:subject values) - keep only those containing the string 'rdf'
def rdfSubFilter(subj,pred,obj):
    return bool(obj.find('rdf')+1)

#Execute query
where = GraphPattern([("?item",rdf_type,URIRef('http://purl.org/rss/1.0/item')),
                       ("?item",dc_subject,"?subj",rdfSubFilter),
                       ("?item",rssTitle,"?title")])    
tStore = myTripleStore(FtRdf(model))
result = tStore.query(select,where)
pprint(result)

The result (which will change daily as my links shift thru my del.icio.us channel queue:

[chimezie@Zion RDF-API]$ python FtRdfBackend.py
[u'rdflibUtils',
 u'Representing Specified Values in OWL: "value partitions" and "value sets"',
 u'Sparta',
 u'planner-rdf',
 u'RDF Template Language 1.0',
 u'SIOC Vocabulary Specification',
 u'SPARQL in RDFLib',
 u'MeetingRecords - ESW Wiki',
 u'Enumerated datatypes (OWL)',
 u'Defining N-ary Relations on the Semantic Web: Use With Individuals']

Chimezie Ogbuji

via Copia

What Are You Doing, Dave?

I just updated the 4Suite Repository Ontology (as an OWL instance). Specifically, I added appropriate documentation for most of the major components and added rdfs:subPropertyOf/rdfs:subClass/rdfs:seeAlso relationships with appropriate / related vocabularies (WordNet/Foaf/Dublin Core/Wikipedia). In addition, where appropriate, I've added links to 4suite literature (currently scattered between IBM Developer Works articles/tutorials and Uche's Akara sections).

There are some benefits:

  • This can serve as a framework for documenting the 4Suite repository (to augment the very sparse documentation that does exist)
  • Provide a formal model for the underlying RDF Graph that 'drives' the repository

This latter benefit might not be so obvious, but imagine being able to provide rules that cause implications identifying certain repository containers as RSS channels (and their child Xml documents / Rdf document as the corresponding RSS items) and associating Foaf metadata with repository users.

Some of the more powerful hooks to the System RDF graph (which the above ontology is a model of) - such as the starting/stopping of servers (currently triggered by the fchema:server.running property on fchema:server instances), purging of resources marked as temporary (by the fchema:time_to_live property), and triggering of an XSLT transform (by the fchema:run_on_strobe property) - can be further augmented by other associations in the graph, resulting in an almost 'sentient' content/application server. A little far-fetched?

[Uche Ogbuji]

via Copia

Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia