Optimizing XML to RDF mappings for Content Management Persistence

I recently re-factored the 4Suite repository's persistent layer for the purpose of making it more responsive to large sets of data. The 4Suite repository's persistence stack – which consists of a set of core APIs for the various first class resources - is the heart and sole of a framework that leverages XML and RDF in tandem as a platform for content management. Essentially, the changes minimized the amount of redundant RDF statements mirrored into the system graph (an RDF graph where provenance statements about resources in the virtual filesystem are persisted) from the metadata XML documents associated with every resource in the repository.

The ability to mirror RDF content from XML documents in a controlled manner is core to the repository and the way it manages it's virtual filesystem. This mapping is made possible by a mechanism called document definitions. Document definitions are mappings (persisted as documents in the 4Suite repository) of controlled XML vocabularies into corresponding RDF statements. Every resource has a small 'metadata' XML document associated with it that captures ACL data as well as system-level provenance typically associated with filesystems.

For example, the metadata document for the root container of the 4Suite instance running on my laptop is:

<?xml version="1.0" encoding="utf-8"?>
<ftss:MetaData 
  xmlns:ftss="http://xmlns.4suite.org/reserved" 
  path="/" 
  document-definition="http://schemas.4suite.org/4ss#xmldocument.null_document_definition"   
  type="http://schemas.4suite.org/4ss#container" creation-date="2006-03-26T00:35:02Z">
  <ftss:Acl>
    <ftss:Access ident="owner" type="execute" allowed="1"/>  
    <ftss:Access ident="world" type="execute" allowed="1"/> 
    <ftss:Access ident="super-users" type="execute" allowed="1"/>  
    <ftss:Access ident="owner" type="read" allowed="1"/>
    <ftss:Access ident="world" type="read" allowed="1"/>    
    <ftss:Access ident="super-users" type="read" allowed="1"/>  
    <ftss:Access ident="owner" type="write user model" allowed="1"/>
    <ftss:Access ident="super-users" type="write user model" allowed="1"/>  
    <ftss:Access ident="owner" type="change permissions" allowed="1"/>  
    <ftss:Access ident="super-users" type="change permissions" allowed="1"/>
    <ftss:Access ident="owner" type="write" allowed="1"/> 
    <ftss:Access ident="super-users" type="write" allowed="1"/> 
    <ftss:Access ident="owner" type="change owner" allowed="1"/> 
    <ftss:Access ident="super-users" type="change owner" allowed="1"/>
    <ftss:Access ident="owner" type="delete" allowed="1"/>
    <ftss:Access ident="super-users" type="delete" allowed="1"/>
  </ftss:Acl>
  <ftss:LastModifiedDate>2006-03-26T00:36:51Z</ftss:LastModifiedDate>
  <ftss:Owner>super-users</ftss:Owner>
  <ftss:Imt>text/xml</ftss:Imt>
  <ftss:Size>419</ftss:Size>
</ftss:MetaData>

Each ftss:Access element under ftss;Acl represents an entry in the ACL associated with the resource the metadata document is describing. All the ACL accesses enforced by the persistence layer are documented here.

Certain metadata are not reflected into RDF, either because they are queried more often than others and require prompt response or because they are never queried separately from the resource they describe. In either case, querying a small-sized XML document (associated with an already identified resource) is much more efficient than dispatching a query against an RDF graph in which statements about every resource in the repository are assserted.

ACLs are an example and are persisted only as XML content. The persistence layer interprets and performs ACL operations against XML content via XPath / Xupdate evaluations.

Prior to the change, all of the other properties embedded in the metadata document (listed below) were being reflected into RDF redundantly and inefficiently:

@type
@creation-date
@document-definition
ftss:LastModifiedDate
ftss:Imt
ftss:Size
ftss:Owner
ftss:TimeToLive

Not too long ago, I hacked (and wrote a piece on it) up an OWL ontology describing these system-level RDF statements.

Most of the inefficiency was due to the fact that a pre-parsed Domlette instance of the metadata document for each resource was already being cached by the persistence layer. However the corresponding APIs for these properties (getLastModifiedDate, for example) were being implemented as queries against the mirrored RDF content. Modifying these methods to evaluate pre-compiled XPaths against the cached DOM instances proved to be several orders of magnitudes more efficient, especially against a repository with a large number of resources in the virtual filesystem.

Of all the above 'properties', only @type (which was being mirrored as rdf:type statemements in RDF), @document-definition, and ftss:TimeToLive were being queried independently from the resources they are associated with. For example, the repository periodically monitors the system RDF graph for ftss:TimeToLive statements whose values are less than the current date time (which indicates their TTL has expired). Expired resources can not be determined by XPath evaluations against metadata XML documents, since XPath is scoped to a specific document by design. If the metadata documents were persisted in a native XML store then the same queries could be dispatched (as an XQuery) across all the metadata documents in order to identify those whose TTL had expired. But I digress...

The @document-defintion attribute associates the resource (an XML document in this case) with a user-defined mapping (expressed as an XSLT transform or a set of XPath to RDF statement templates) which couples it's content with corresponding RDF statements. This presents a interesting scenario where if a document definition changes (document definitions are themselves first-class resources in the repository), then all the documents which refer to it must have their RDF statements re-mapped using the new document definition.

Note, such coupling only works in a controlled, closed system and isn't possible where such mappings from XML to RDF are uncontrolled (ala GRDDL) and work in a distributed context.

At any rate, the @document-definition property was yet another example of system metadata that had to be mirrored into the system RDF graph since document definitions need to be identified independently from the resources that register them.

In the end, only the metadata properties that had to be queried in this fashion were mirrored into RDF. I found this refactoring very instructive in identifying some caveats to be aware of when modeling large scale data sets as XML and RDF interchangeably. This very small architectural modification yielded quite a significant performance boost for the 4Suite repository, which (as far as I can tell) is the only content-management system that leverages XML and RDF as interchangeable representation formats in such an integrated fashion.

[Uche Ogbuji]

via Copia

2 responses

Hmm. "heart and sole" is either quite the pun or quite the malapropism :-)

More seriously, you mentioned some really cool numbers to me over the phone about the scalability consequences of this patch. I know you're still working to verify and codify numbers, but I do hope you can share soon.

One question: is the para starting with "Note, such coupling" supposed to be a quote? If so, where from? If not, just remove the leading "> ".

--Uche

— uche

Absolutely, I'm currently working on a large data set and hope to get some numbers on scale. I had also promised on an earlier post to do the same about how well the current rdflib SQL framework scales (now that RDF statements are partitioned and simply stored in one monolithic table) - I just haven't been able to find the time.

I wanted to emphasize the "Note, such coupling"... and probably should't have written it as a quote.

— Chimezie Ogbuji