Comprehensive RDF Query API's for RDFLib

[by Chimezie Ogbuji]

RDFLib's support for SPARQL has come full circle and I wasn't planning on blogging on the developments until they had settled some – and they have. In particular, the last piece was finalizing a set of APIs for querying and result processing that fit well within the framework of RDFLib's various Graph API's. The other issue was for the query APIs to accomodate eventual support for other querying languages (Versa for instance) that are capable of picking up the slack where SPARQL is wanting (transitive closures, for instance – try composing a concise SPARQL query for calculating the transitive closure of a given node along the rdfs:subClassOf property and you'll immediately see what I mean).

Querying

Every Graph instance now has a query method through which RDF queries can be dispatched:

def query(self, strOrQuery, initBindings={}, initNs={}, DEBUG=False, processor="sparql")
    """
    Executes a SPARQL query (eventually will support Versa queries with same method) against this Conjunctive Graph
    strOrQuery - Is either a string consisting of the SPARQL query or an instance of rdflib.sparql.bison.Query.Query
    initBindings - A mapping from variable name to an RDFLib term (used for initial bindings for SPARQL query)
    initNS - A mapping from a namespace prefix to an instance of rdflib.Namespace (used for SPARQL query)
    DEBUG - A boolean flag passed on to the SPARQL parser and evaluation engine
    processor - The kind of RDF query (must be 'sparql' until Versa is ported)
    """

The first argument is either a query string or a pre-compiled query object (compiled using the appropriate BisonGen mechanism for the target query language). Pre-compilation can be useful for avoiding redundant parsing overhead for queries that need to be evaluated repeatedly:

from rdflib.sparql.bison import Parse
queryObject = Parse(sparqlString)

The next argument (initBindings) is dictionary that maps variables to their values. Though variables are common to both languages, SPARQL variables differ from Versa queries in that they are string terms in the form “?varName”, wherease in Versa variables are QNames (same as in Xpath). For SPARQL queries the dictionary is expected to be a mapping from variables to RDFLib terms. This is passed on to the SPARQL processor as initial variable bindings.

initNs is yet another top-level parameter for the query processor: a namespace mapping from prefixes to namespace URIs.

The debug flag is pretty self explanatory. When True, it will cause additional print statements to appear for the parsing of the query (triggered by BisonGen) as well as the patterns and constraints passed on to the processor (for SPARQL queries).

Finally, the processor specifies which kind of processor to use to evaluate the query: 'versa' or 'sparql'. Currently (with emphasis on 'currently'), only SPARQL is supported.

Result formats

SPARQL has two result formats (JSON and XML). Thanks to Ivan Herman's recent contribution the SPARQL processor now supports both formats. The query method (above) returns instances of QueryResult, a common class for RDF query results which define the following method:

def serialize(self,format='xml')

The format argument determines which result format to use. For SPARQL queries, the allowable values are: 'graph' – for CONSTRUCT / DESCRIBE queries (in which case a resulting Graph object is returned), 'json',or 'xml'. The resulting object also behaves as an iterator over the bindings for manipulation in the host language (Python).

Versa has it's own set of result formats as well. Primarily there is an XML result format (see: Versa by Example) as well as Python classes for the various internal datatypes: strings,resources,lists,sets,and numbers. So, the eventual idea is for using the same method signature for serializing Versa queries as XML as you would for SPARQL queries.

SPARQL Limitations

The known SPARQL query forms that aren't supported are:

  • DESCRIBE/CONSTRUCT (very temporary)
  • Nested Group Graph Patterns
  • Graph Patterns can only be used (once) by themselves or with OPTIONAL patterns
  • UNION patterns can only be combined with OPTIONAL patterns
  • Outer FILTERs which refer to variables within an OPTIONAL

Alot of the above limitations can be addressed with formal equivalence axioms of SPARQL semantics, such as those mentioned in the recent paper on the complexity and semantics of SPARQL. Since there is very little guidance on the semantics of SPARQL I was left with the option of implementing only those equivalences that seemed obvious (in order to support the patterns in the DAWG test suite):

1) { patternA { patternB } } => { patternA. patternB }
2) { basicGraphPatternA OPTIONAL { .. } basicGraphPatternB }
  =>
{ basicGraphPatternA+B OPTIONAL { .. }}

It's still a work in progress but has come quite a long way. CONSTRUCT and DESCRIBE are already supported by the underlying processor, I just need to find some time to hook it up to the query interfaces. Time is something I've been real short of lately.

Chimezie Ogbuji

via Copia

Updating Metacognition Software Stack

Metacognition was down for some time as I updated the software stack that it runs on (namely 4Suite and RDFLib). There were some core changes:

  • 4Suite repository was reinitialized using the more recent persistence drivers.
  • I added a seperate section for archived publications (I use Google Reader's label service for copia archives)
  • I switched to using RDFLib as the primary RDF store (using a recently written maping between N3/FOL and a highly efficient SQL schema) and the filesystem for everything else
  • I added the SIOC ontology to the core index of ontologies
  • Updated my FOAF graph and Emeka's DOAP graph

Earlier, I wrote up the mapping in a formal notation (using MathML so it will only be viewable in a browser that supports it - like firefox) that RDFLib's FOPLRelationalModel was based on.

In particular, it's incredibly more responsive and better organized. Generally, I hope for it to serve two purposes: 1) Organize my thoughts on and software related to applying Semantic Web Technologies to 'closed' content management systems 2) Serve as a (markdown-powered) whiteboard and playground for tools / demos for advocacy on best practices in problem solving with these technologies.

Below is a marked-up diagram of some of these ideas.

Metacognition-Roadmap

The publications are stored in a single XML file and are rendered (run-time at the server) using a pre-compiled XSLT stylesheet against a cached Domlette. Internally the document is mapped into RDF persistence using an XSLT document definition associated with the document so all modifications are synched into an RDF/XML equivalent.

Mostly as an academic exercise - since the 4Suite repository (currently) doesn't support document definitions that output N3 and the content-type of XSLT transform responses is limited to HTML/XML/Text - I wrote an equivalent publications-to-n3.xslt. The output is here.

Chimezie Ogbuji

via Copia

SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the setup.py had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

  • SPARQL.bgen (the main file that includes the others)
  • SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
  • SPARQLTokens.bgen.frag (Token definitions)
  • SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
  • SPARQLLexerPatterns.bgen.frag (lexer definition)
  • SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
  • SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

  • I wanted to get it through parsing the queries in the test case
  • Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
  • Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

  • Test round-tripping of SPARQL queries
  • Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia

Cleveland Clinic Job Posting for Data Warehouse Specialist

Cleveland Clinic Foundation has recently posted a job position for a mid-level database developer with experience in semi-structured data binding, transformation, modeling, and querying. I happen to know that experience with the following are big pluses:

  • Python
  • XML data binding
  • XML / RDF querying (SPARQL,XPath,XQuery,etc..)
  • XML / RDF modelling
  • Programming database connections
  • *NIX System administration
  • Web application frameworks

You can follow the above link to the post an application.

Chimezie Ogbuji

via Copia

Practical Temporal Reasoning with Notation 3

Recently, Dan Brickley expressed an interest in the extent to which Bioinformatic research efforts are leveraging RDF for temporal reasoning (and patient healthcare record integration - in general). The thread on the value of modeling temporal relations explicitly versus relying on them being built into core RDF semantics left me feeling like a concrete example was in order.

We have a large (3500+ assertions) OWL Full ontology describing all the data we collect about Cardiothoracic procedures (the primary purpose of our database as currently constituted – in a relational model). There are several high-level classes we use to model concepts that, though core to our model, can be thought of as general enough for a common upper ontology for patient data.

One of the classes is ptrec:TemporalData (from here on out, I'll be using the ptrec prefix to describe vocabulary terms in our ontology) which is the ancestor of all classes that are expressed on an axis of time. We achieve a level of precision in modeling data on a temporal axis that enhances the kind of statistical analysis we perform on a daily basis.

In particular we use three variables:

  • ptrec:startDT (xs:dateTime)
  • ptrec:stopDT (xs:dateTime)
  • ptrec:instantDT (xs:dateTime)

The first two are used to describe an explicit (and 'proper') interval for an event in a patient record. This is often the case where the event in question only had a date associated with it. The latter variable is used when the event is instantaneous and the associated date / time is known.

The biggest challenge isn't simply the importance of time in asking questions of our data but of temporal factors that are keyed off specific, moving points of reference. For example, consider a case study on the effects of administering a medication within X days of specific procedure. The qualifying procedure is key to the observations we wish to make and behaves as a temporal anchor. Another case study interested in the effects of administering the same medication but with respect to a different procedure should be expected to rely on the same temporal logic – but keyed off a different point in time. However, by being explicit about how we place temporal data on a time axis (as instants or intervals) we can outline a logic for general temporal reasoning that can be used by either case study.

Linking into an OWL time ontology we can setup some simple Notation 3 rules for inferring interval relationships to aid such questions:

#Infering before and after temporal relationships (between instants and intervals alike)
{?a a ptrec:TemporalData;
    ptrec:instantDT ?timeA. 
 ?b a ptrec:TemporalData;
    ptrec:instantDT ?timeB. ?timeA str:greaterThan ?timeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB. ?startTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

#Infering during and contains temporal relationships (between proper intervals)
#Since there is no str:greaterThanOrEqual CWM function, the various permutations
#Are spelled out explicitely
{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:equalIgnoringCase ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:equalIgnoringCase ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

Notice the value in xs:dateTime values being ordered temporally and as unicode, simultaneously. This allows us rely on str:lessThan and str:greaterThan for determining interval intersection and overlap.

Terms such as 'preoperative' (which refer to events that occurred before a specific procedure / operation) and 'postoperative' (events that occurred after a specific procedure / operation), which are core to general medical research nomenclature, can be tied directly into this logic:

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intBefore ?b}
   => {?a ptrec:preOperativeWRT ?b}

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intAfter ?b}
   => {?a ptrec:postOperativeWRT ?b}

Here we introduce two terms (ptrec:preOperativeWRT and ptrec:postOperativeWRT) which relate temporal data with an operation in the same patient record. Using interval relationships as a foundation you can link in domain-specific, temporal vocabulary into your temporal reasoning model, and rely on a reasoner to setup a framework for temporal reasoning.

Imagine the value in using a backward-chaining prover (such as Euler) to logically demonstrate exactly why a specific medication (associated with the date when it was administered) is considered to be preoperative with respect to a qualifying procedure. This would complement the statistical analysis of a case study quite nicely with formal logical proof.

Now, it's worth noting that such a framework (as it currently stands) doesn't allow precision of interval relationships beyond simple intersection and overlap. For instance, in most cases you would be interested primarily in medication administered within a specific length of time. This doesn't really impact the above framework since it is no more than a functional requirement to be able to perform calendar math. Imagine if the built-in properties of CWM were expanded to include functions for performing date math. for instance:

With such a function we can expand our logical framework to include more explicit temporal relationships.
For example, if we only wanted to consider medications that were done 30 days prior to an operation to be considered 'preoperative':

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:Operation;
    ptrec:startDT ?opStartTime;
    ptrec:stopDT ?opStopTime.  
 ?a time:intBefore ?b.
 (?stopTime "-P30D") time:addDT ?preOpMin. ?stopTimeA str:lessThan ?preOpMin}
    => {?a ptrec:preOperativeWRT ?b}

It's worth noting that such an addition (to facilitate calendar math) would be quite useful as a general extension for RDF processors.

For the most part, I think a majority of the requirements needed for temporal reasoning (in any domain) can be accommodated by explicit modeling, because FOPL (the foundation upon which RDF is built) was designed to be expressive enough to represent all human concepts.

Chimezie Ogbuji

via Copia

Extension Functionality and Set Manipulation in RDF Query Languages

A recent bit by Andy Seaborne (on Property Functions in ARQ – Jena's query engine) got me thinking about general extension mechanisms for RDF querying languages.
In particular, he mentions two extensions that provide functionality for processing RDF lists and collections which (ironically) coincide with functions I had requested be considered for later generations of Versa.

The difference, in my case, was that the suggestions were for functions that cast RDF lists into Versa lists (or sets) – which are data structures native to Versa that can be processed with certain built-in functions.

Two other extensions I use quite often in Versa (and mentioned briefly in my XML.com article) are scope, and scoped-subquery. These have to do with identifying the context of a resource and limiting the evaluation of a query to within a named graph, respectively. Currently, the scoped function returns a list of the names of all graphs in which the resource is asserted as a member of any class (via rdf:type). I could imagine this being expanded to include the names of any graph in which statements about the resource are asserted. scoped-subquery doesn't present much value for a host language that can express queries as limited to a named context.

I also had some thoughts about an extension function mechanism that allowed an undefined function reference (for functions of arity 1 – i.e. functions that take only a single argument) to be interpreted as a request for all the objects of statements where the predicate is the function URI and the subject is the argument

I recently finished writing a SPARQL grammar for BisonGen and hope to conclude that effort (at some point) once I get over a few hurdles. I was pleasantly surprised to find that the grammar for function invocation is pretty much identical for both query languages. Which suggests that there is room for some thought about a common mechanism (or just a common set of extension functionality – similar to the EXSLT effort) for RDF querying or general processing.

CWM has a rich, and well documented set of RDF extensions. The caveat is that the method signatures are restricted to dual input (subject and object) since the built-ins are expressed as RDF triples where the predicate is the name of the function and the subject and object are arguments to is. Nevertheless, it is a good source from which an ERDF specification could be drafted.

My short wish-list of extension functions in such an effort would include:

  • List comprehension (intersection, union, difference, membership, indexing, etc.)
  • Resolving the context for a given node: context(rdfNode) => URI (or BNode?)
  • an is-a(resource) function (equivalent to Versa's type function without the entailment hooks)
  • a class(resource) which is an inverse of is-a
  • Functions for transitive closures and/or traversals (at the very least)
  • A fallback mechanism for undefined functions that allowed them to be interpreted as unary 'predicate functions'

Of course, functions which return lists instead of single resources would be problematic for any host language that can't process lists, but it's just some food for thought.

Chimezie Ogbuji

via Copia

"Semantic Transparency" by any other name

In response to "Semantic hairball, y'all" Paul Downey responded with approval of my skewering of some of the technologies I see dominating the semantics space, but did say:

..."semantic transparency" in "XML Schema" sounds just a little too scary for my tastes....

This could mean that the names sound scary, or that his interpretation of the idea itself sounds scary. If it's the latter, I'll try to show soon that the idea is very simple and shouldn't be at all scary. If it's the former, the man has a point. "Semantic Transparency" is a very ungainly name. As far as I can tell, it was coined by Robin Cover, and I'm sure it was quite suitable at the time, but for sure right now it's a bit of a liability in the pursuit that most interests me.

The pursuit is of ways to build on the prodigious success of XML to make truly revolutionary changes in data architecture within and across organizations. Not revolutionary in terms of the technology to be used. In fact, as I said in "No one ever got fired for...", the trick is to finally give age-old and well proven Web architecture more than a peripheral role in enterprise architecture. The revolution would be in the effects on corporate culture that could come from the increased openness and collaboration being fostered in current Web trends.

XML ushered in a small revolution by at least codifying a syntactic basis for general purpose information exchange. A common alphabet for sharing. Much is made of the division between data and documents in XML (more accurate terms have been proposed, including records versus narrative, but I think people understand quite well what's meant by the data/documents divide, and those terms are just fine). The key to XML is that even though it's much more suited to documents, it adapts fairly well to data applications. Technologies born in the data world such as relational and object databases have never been nearly as suitable for document applications, despite shrill claims of relational fundamentalists. XML's syntax mini-revolution means that for once those trying to make enterprise information systems more transparent by consolidating departmental databases into massive stores (call them the data warehouse empire), and those trying to consolidate documents into titanic content management stores (call them the CMS empire) can use the same alphabet (note: the latter group is usually closely allied with those who want to make all that intellectual capital extremely easy to exchange under contract terms. Call them the EDI empire). The common alphabet might not be ideal for any one side at the table, but it's a place to start building interoperability, and along with that the next generation of applications.

All over the place I find in my consulting and speaking that people have embraced XML just to run into the inevitable limitations of its syntactic interoperability and scratch their head wondering OK, what's the next stop on this bus route? People who know how to make money have latched onto the suspense, largely as a way of re-emphasizing the relevance of their traditional products and services, rather than as a way to push for further evolution. A few more idealistic visionaries are pushing such further evolution, some rallying under the banner of the "Semantic Web". I think this term is, unfortunately, tainted. Too much of the 70s AI ambition has been cooked into recent iterations of Semantic Web technologies, and these technologies are completely glazing over the eyes of the folks who really matter: the non-Ph.Ds (for the most part) who generate the vast bodies of public and private documents and data that are to drive the burgeoning information economy.

Some people building on XML are looking for a sort of mindshare arbitrage between the sharp vendors and the polyester hippies, touting sloppy, bottom-up initiatives such as microformats and folksonomies. These are cheap, and don't make the head spin to contemplate, but it seems clear to anyone paying attention that they don't scale as a way to consolidate knowledge any more than the original Web does.

I believe all these forces will offer significant juice to next generation information systems, and that the debate really is just how the success will be apportioned. As such, we still need an umbrella term for what it means to build on a syntactic foundation by sharing context as well. To start sharing glossaries as well as alphabets. The fate (IMO) of the term "Semantic Web" is instructive. I often joke about the curse of the s-word. It's a joke I picked up from elsewhere (I forget where) to the effect that certain words starting with "s", and "semantic" in particular are doomed to sound imposing yet impossibly vague. My first thought was: rather than "semantic transparency", how about just "transparency? The problem is that it's a bit too much of a hijack of the generic. A data architect probably will get the right picture from the term, but we need to communicate to ithe wider world.

Other ideas that occur to me are:

  • "information transparency"
  • "shared context" or "context sharing"
  • "merged context"
  • "context framing"
  • "Web reference"

The latter idea comes from my favorite metaphor for these XML++ technologies: that they are the reference section (plus card catalog) of the library (see "Managing XML libraries"). They are what makes it possible to find, cross-reference and understand all the information in the actual books themselves. I'm still fishing for good terms, and would welcome any suggestions.

[Uche Ogbuji]

via Copia

Binary Predicates in FOPL and at Large Volumes

I've wanted to comeback to the issue of RDF scalability of a relational model for some time (a topic that has been on my mind for some time). Earlier, I mentioned a Description Logics (DL) representation technique that would dramatically reduce the amount of size needed for most RDF graphs. I only know of one other RDF store (besides rdflib) that does this. At large volumes, metrics of query response time are more succeptible to space efficiency than pure seek time. At some point along the numerical scale, there will be a point where the amount of time it takes to resolve a query is more directly affected by the size of the knowledge base than anything else. When you consider the URI lexical grammar, skolemization, seek times, BTrees, and Hash-tables even interning (by that I mean the general reliance on uniqueness in crafting identifiers) has little effect to the high-volume metrics of FOPL.

Perhaps something more could be said about the efficiency of DL? I've suggested the possiblity of semantic compression (or 'forward consumption' if you think of it as analagous to forward chaining) where what can be implied is never added or is removed by some intelligent process (perhaps periodically). For example, consider a knowledge base that only stored 'knows' relationships (foaf:knows, perhaps) between between people. It would be very redundant to state that two individual are 'People' (foaf:Person) if they know each other (66.6% space saving right there). Couldn't the formality of DL be used to both enhance expressiveness as well as efficiency? In the same way that invariant representations make our neocortex so much more efficient at logical prediction? If not DL, perhaps at least the formality of a local domain ontology and it's rules? I was able to apply the same principle (though not in any formal way you could automate) to improve the speed of a content management knowledge base.

[Uche Ogbuji]

via Copia

Optimizing XML to RDF mappings for Content Management Persistence

I recently re-factored the 4Suite repository's persistent layer for the purpose of making it more responsive to large sets of data. The 4Suite repository's persistence stack – which consists of a set of core APIs for the various first class resources - is the heart and sole of a framework that leverages XML and RDF in tandem as a platform for content management. Essentially, the changes minimized the amount of redundant RDF statements mirrored into the system graph (an RDF graph where provenance statements about resources in the virtual filesystem are persisted) from the metadata XML documents associated with every resource in the repository.

The ability to mirror RDF content from XML documents in a controlled manner is core to the repository and the way it manages it's virtual filesystem. This mapping is made possible by a mechanism called document definitions. Document definitions are mappings (persisted as documents in the 4Suite repository) of controlled XML vocabularies into corresponding RDF statements. Every resource has a small 'metadata' XML document associated with it that captures ACL data as well as system-level provenance typically associated with filesystems.

For example, the metadata document for the root container of the 4Suite instance running on my laptop is:

<?xml version="1.0" encoding="utf-8"?>
<ftss:MetaData 
  xmlns:ftss="http://xmlns.4suite.org/reserved" 
  path="/" 
  document-definition="http://schemas.4suite.org/4ss#xmldocument.null_document_definition"   
  type="http://schemas.4suite.org/4ss#container" creation-date="2006-03-26T00:35:02Z">
  <ftss:Acl>
    <ftss:Access ident="owner" type="execute" allowed="1"/>  
    <ftss:Access ident="world" type="execute" allowed="1"/> 
    <ftss:Access ident="super-users" type="execute" allowed="1"/>  
    <ftss:Access ident="owner" type="read" allowed="1"/>
    <ftss:Access ident="world" type="read" allowed="1"/>    
    <ftss:Access ident="super-users" type="read" allowed="1"/>  
    <ftss:Access ident="owner" type="write user model" allowed="1"/>
    <ftss:Access ident="super-users" type="write user model" allowed="1"/>  
    <ftss:Access ident="owner" type="change permissions" allowed="1"/>  
    <ftss:Access ident="super-users" type="change permissions" allowed="1"/>
    <ftss:Access ident="owner" type="write" allowed="1"/> 
    <ftss:Access ident="super-users" type="write" allowed="1"/> 
    <ftss:Access ident="owner" type="change owner" allowed="1"/> 
    <ftss:Access ident="super-users" type="change owner" allowed="1"/>
    <ftss:Access ident="owner" type="delete" allowed="1"/>
    <ftss:Access ident="super-users" type="delete" allowed="1"/>
  </ftss:Acl>
  <ftss:LastModifiedDate>2006-03-26T00:36:51Z</ftss:LastModifiedDate>
  <ftss:Owner>super-users</ftss:Owner>
  <ftss:Imt>text/xml</ftss:Imt>
  <ftss:Size>419</ftss:Size>
</ftss:MetaData>

Each ftss:Access element under ftss;Acl represents an entry in the ACL associated with the resource the metadata document is describing. All the ACL accesses enforced by the persistence layer are documented here.

Certain metadata are not reflected into RDF, either because they are queried more often than others and require prompt response or because they are never queried separately from the resource they describe. In either case, querying a small-sized XML document (associated with an already identified resource) is much more efficient than dispatching a query against an RDF graph in which statements about every resource in the repository are assserted.

ACLs are an example and are persisted only as XML content. The persistence layer interprets and performs ACL operations against XML content via XPath / Xupdate evaluations.

Prior to the change, all of the other properties embedded in the metadata document (listed below) were being reflected into RDF redundantly and inefficiently:

  • @type
  • @creation-date
  • @document-definition
  • ftss:LastModifiedDate
  • ftss:Imt
  • ftss:Size
  • ftss:Owner
  • ftss:TimeToLive

Not too long ago, I hacked (and wrote a piece on it) up an OWL ontology describing these system-level RDF statements.

Most of the inefficiency was due to the fact that a pre-parsed Domlette instance of the metadata document for each resource was already being cached by the persistence layer. However the corresponding APIs for these properties (getLastModifiedDate, for example) were being implemented as queries against the mirrored RDF content. Modifying these methods to evaluate pre-compiled XPaths against the cached DOM instances proved to be several orders of magnitudes more efficient, especially against a repository with a large number of resources in the virtual filesystem.

Of all the above 'properties', only @type (which was being mirrored as rdf:type statemements in RDF), @document-definition, and ftss:TimeToLive were being queried independently from the resources they are associated with. For example, the repository periodically monitors the system RDF graph for ftss:TimeToLive statements whose values are less than the current date time (which indicates their TTL has expired). Expired resources can not be determined by XPath evaluations against metadata XML documents, since XPath is scoped to a specific document by design. If the metadata documents were persisted in a native XML store then the same queries could be dispatched (as an XQuery) across all the metadata documents in order to identify those whose TTL had expired. But I digress...

The @document-defintion attribute associates the resource (an XML document in this case) with a user-defined mapping (expressed as an XSLT transform or a set of XPath to RDF statement templates) which couples it's content with corresponding RDF statements. This presents a interesting scenario where if a document definition changes (document definitions are themselves first-class resources in the repository), then all the documents which refer to it must have their RDF statements re-mapped using the new document definition.

Note, such coupling only works in a controlled, closed system and isn't possible where such mappings from XML to RDF are uncontrolled (ala GRDDL) and work in a distributed context.

At any rate, the @document-definition property was yet another example of system metadata that had to be mirrored into the system RDF graph since document definitions need to be identified independently from the resources that register them.

In the end, only the metadata properties that had to be queried in this fashion were mirrored into RDF. I found this refactoring very instructive in identifying some caveats to be aware of when modeling large scale data sets as XML and RDF interchangeably. This very small architectural modification yielded quite a significant performance boost for the 4Suite repository, which (as far as I can tell) is the only content-management system that leverages XML and RDF as interchangeable representation formats in such an integrated fashion.

[Uche Ogbuji]

via Copia

Mi...cro...for...mats...sis...boom...BLAH!

Mike Linksvayer had a nice comment on my recent talk at the Semantic Technology Conference.

I think Uche Ogbuji's Microformats: Partial Bridge from XML to the Semantic Web is the first talk I've heard on that I've heard from a non-cheerleader and was a pretty good introduction to the upsides and downsides of microformats and how can leverage microformats for officious Semantic Web purposes. My opinion is that the value in microformats hype is in encouraging people to take advantage of XHTML semantics in however a conventional in non-rigorous fashion they may. It is a pipe dream to think that most pages containing microformats will include the correct profile references to allow a spec-following crawler to extract much useful data via GRDDL. Add some convention-following heuristics a crawler may get lots of interesting data from microformatted pages. The big search engines are great at tolerating ambiguity and non-conformance, as they must.

Yeah, I'm no cheerleader (or even follower) for Microformats. Certainly I've been skeptical of Microformats here on Copia (1, 2, 3). I think that the problem with Microformats is that value is tied very closely to hype. I think that as long as they're a hot technology they can be a useful technology. I do think, however, that they have very little intrinsic technological value. I guess one could say this about many technologies, but Microformats perhaps annoy me a bit more because given XML as a base, we could do so much better.

Mike is also right to be skeptical that GRDDL will succeed if, as it presently does, it relies on people putting profile information into Web documents that use Microformats.

My experience at the conference, some very trenchant questions from the audience, A very interesting talk by Ben Adida right after my own, and more matters have got me thinking a lot about Microformats and what those of us whose information consolidation goals are more ambitious might be able to make of them. Watch this space. More to come.

[Uche Ogbuji]

via Copia