Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
 -h, --help            show this help message and exit
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               badChars = False
               for binding in node.binding:
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

GRDDL: A Knowledge Rrepresentation Architectural Style? - part one

Yay. GRDDL is now a W3C Recommendation!

I'm very proud to be a part of that and there is alot about this particular architectural style that I have always wanted to write about. I recently came upon the opportunity to consider one particular facet.

This is why it seems the same as with GRDDL. There are tranformations you can make, but they are not entailments in a logic, they just go from one graph to a different graph.

Yes, that is one part of the larger framework that is well considered. GRDDL does not rely on logical entaiment for its normative definition. It is defined operationally, but can also be described via declarative (formal) semantics. It defines a mapping (not a function in the true sense - the specification clearly identifies ambiguity at the level of the infoset) from an XML representation of an "information resource" to a typed RDF representation of the same "information resource". The output is required to have a well-defined mapping of its own into the RDF abstract syntax.

The less formal definition uses a dialect of Notation 3 that is a bit more expressive than Datalog Logic Programming (it uses function symbols - builtins - in some of the clauses ). The proof at the bottom of that page justifies the assertion that has a GRDDL result which is composed entirely of the following RDF statement:

<> is named "Are You Experienced?" .

Frankly, I would have gone with "Bold as Love", myself =)

Once you have a (mostly) well-defined function for rendering RDF from information resources, you enable the deployment of useful ( and re-usable ) interpretations for intelligent agents (more on these later). For example, the test suite, is a large semantic web of XML documents that GRDDL-aware agents can traverse, performing Quality Assurance tests (using EARL) of their conformance to the operational semantics of GRDDL.

However, it was very important to leave entailment out of the equation until it serves a justifiable purpose. For example, a well-designed RDF querying language does not require logical entailment (RDF, RDFS, OWL, or otherwise) for it to be useful in the general case. You can calculate a closure (or Herbrand base) and then dispatch structural graph queries. This was always true with Versa. You can glean (pun intended) quite a bit from only the structural nature of a Graph. A whole generation of graph theoretical literature demonstrates this.

In addition, once you have a well-defined set of semantics for identifying information resources with RDF assertions that are (logically) relevant to the closure, you have a clear seperation between manipulation of surface syntax and full-blown logical reasoning systems.

It should be considered a semantic web architectural style (if you will) to constrain the use of entailment to only where it has some demonstrated value to the problem space. Where it makes sense to use entailment, however, you will find the representations are well-engineered for the task.

Chimezie Ogbuji

via Copia

A Content Repository API for Rich, Semantic Web Applications?

[by Chimezie Ogbuji]

I've been working with roll-your-own content repositories long enough to know that open standards are long overdue.

The slides for my Semantic Technology Conference 2007 session are up: "Tools for the Next Generation of CMS: XML, RDF, & GRDDL" (Open Office) and (Power point)

This afternoon, I merged documentation of the 4Suite repository from old bits (and some new) into a Wiki that I hope to contribute to (every now and then).
I think there is plenty of mature, supporting material upon which a canon of best practices for XML/RDF CMSes can be written, with normative dependencies on:

  • XProc
  • Architecture of the World Wide Web, Volume One
  • URI RFCs
  • Rich Web Application Backplane
  • XML / XML Base / XML Infoset
  • RDDL
  • XHTML 1.0
  • SPARQL / Versa (RDF querying)
  • XPath 2.0 (JSR 283 restriction) with 1.0 'fallback'
  • HTTP 1.0/1.1, ACL-based HTTP Basic / Digest authentication, and a convention for Web-based XSLT invokation
  • Document/graph-level ACL granularity

The things that are missing:

  • RDF equivalent of DOM Level 3 (transactional, named graphs, connection management, triple patterns, ... ) with events.
  • A mature RIF (there is always SWRL, Notation 3, and DLP) as a framework for SW expert systems (and sentient resource management)
  • A RESTful service description to complement the current WSDL/SOAP one

For a RESTful service description, RDF Forms can be employed to describe transport semantics (to help with Agent autonomy), or a mapping to the Atom Publishing Protocol (and a thus a subset of GData) can be written.

In my session, I emphasized how closely JSR 283 overlaps with the 4Suite Repository API.

The delta between them mostly has to do with RDF, other additional XML processing specifications (XUpdate, XInclude, etc.), ACL-based HTTP authentication (basic, and sessions), HTTP/XSLT bindings, and other miscellaneous bells and whistles

Chimezie Ogbuji

via Copia

Planet Atom's Information Pipeline

The hosting of Planet Atom has moved (perhaps temporarily) over to Athena: Christopher Schmidt's excellent hosting service.
Metacognition is hosted there. In addition, the planetatom source was extended to support additional representational modes for the aggregated Atom feed: GRDDL, RDFa, Atom OWL RDF/XML (via content negotiation), and Exhibit.

The latter was the subject of my presentation at XTech 2007. As I mentioned during my session, you can go to to see the live faceted-browsing of the aggregated Atom feed. An excerpt from the Planet Atom front page describes the nature of the project:

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, John L. Clark, and Chimezie Ogbuji

I wrote previously (in my XML 2006 paper) on this methodology of splicing out multiple (disjoint) representations from a single XML source and the Exhibit mode is yet another facet: specifically for quick, cross-browser, filtering of the aggregated feed.

Planet Atom Pipeline

The Planet Atom pipleline as a whole is pretty interesting. First an XBEL bookmark document is used as the source for aggregation. RESTful caching minimizes load on the sources during aggregation. The aggregation groups the entries by calendar groups (months and days). The final aggregated feed is then sent through several XML pipelines to generate the JSON which drives the Exhibit view, an HTML version of the front page, an XHTML version of the front page (one of the prior two is served to the client depending on the kind of the agent which requested the front page), and an RDF/XML serialization of the feed expressed in Atom OWL.

Note in the diagram that a Microformat approach could have been used instead to embed the Atom OWL statements. RDFa was used instead as it was much easier to encode the statements in a common language and not contend with adding profiles for each Microformat used. Elias's XTech 2007 presentation touched a bit on this contrast between the two approaches. In either case, GRDDL is used to extract the RDF.

These representations are stored statically at the server and served appropriately via a simple CherryPy module As mentioned earlier, the XHTML front page now embeds the Atom OWL assertions about the feed (as well as assertions about the sources, their authors, and the Planetatom developers) in RDFa and includes hooks for a GRDDL-aware Agent to extract a faithful rendition of the feed in RDF/XML. The same XML pipeline which extracts the RDF/XML from the aggregated feed is identified as a GRDDL transform. So, the RDF can either be fetched via content negotiation or by explicit GRDDL processing.

Unfortunately, the RDFa is broken. RDFa can be extracted by an explicit parser (which is how Elias Torrez's Python-based RDFa parser, his recent work on operator, and Ben Adida's RDFa bookmarklets ) or via XSLT as part of a GRDDL mechanism. Due to a quirk in the way RDFa uses namespace declarations (which may or may not be a necessary evil ), the various vocabularies used in the resulting RDF/XML are not properly expanded from the CURIES to their full URI form. I mentioned this quirk to Steven Pemberton.

As it happens, the stylesheet which transforms the aggregated Atom feed into the XHTML host document defines the namespace declarations:


However, since there are not any elements which use QNames formed from these declarations, they are not included in the XSLT result! This trips up the RDF -> RDF/XML transformation (written by Fabien Gandon, a fellow GRDDL WG member) and results in RDF statements where the URIs are simply the CURIEs as originally expressed in RDFa. This seems to only be a problem for XSLT processors which explicitly strip out the unused namespace declarations. They have a right to do this as it has no effect on the underlying infoset. Something for the RDF-in-XHTML task group to consider, especially for scenarios such as this where the XHTML+RDFa is not hand-crafted but produced from an XML pipeline.

[Uche Ogbuji]

via Copia

Atom Feed Semantics

Not a lot of people outside the core Semantic Web community actually want to create RDF, but extracting it from what's already there can be useful for a wide variety of projects. (RSS and Atom are first and relatively easy steps that direction.)

Terminal dump

chimezie@Zion:~/devel/grddl-hg$ python --debug --output-format=n3 --zone=https:--ns=aowl= --ns=iana= --ns=some-blog=
binding foaf to
binding owl to
binding iana to
binding rdfs to
binding wot to
binding dc to
binding aowl to
binding rdf to
binding some-blog to
Attempting a comprehensive glean of
@@ignoring types: ('application/rdf+xml', 'application/xml', 'text/xml', 'application/xhtml+xml', 'text/html')
applying transformation
@@ignoring types: ('application/xml',)
Parsed 22 triples as Notation 3
Attempting a comprehensive glean of

Via atom2turtle_xslt-1.0.xslt and Atom OWL: The GRDDL result document:

@prefix aowl: <>.
@prefix iana: <>.
@prefix rdf: <>.
@prefix some-blog: <>.
[ a aowl:Feed;
     aowl:author [ a aowl:Person;
             aowl:name "John Doe"];
     aowl:entry [ a aowl:Entry;
             aowl:id "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a"^^<>;
             aowl:link [ a aowl:Link;
                     aowl:rel iana:alternate;
                     aowl:to [ aowl:src some-blog:atom03]];
             aowl:title "Atom-Powered Robots Run Amok";
             aowl:updated "2003-12-13T18:30:02Z"^^<>];
     aowl:id "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6"^^<>;
     aowl:link [ a aowl:Link;
             aowl:rel iana:alternate;
             aowl:to [ aowl:src <>]];
     aowl:title "Example Feed";
     aowl:updated "2003-12-13T18:30:02Z"^^<>].

Planet Atom's feed

@prefix : <> .
 @prefix rdf: <> .
 @prefix rdfs: <> .
 @prefix foaf: <> .
 @prefix iana: <> .
 @prefix xsd: <> .
[] a :Feed ;
:id ""^^xsd:anyURI;
:title "Planet Atom" ;
:updated "2006-12-10T06:57:54.166890Z"^^xsd:dateTime;
:generator [ a :Generator;
            :uri <>;
            :generatorVersion "";
            :name """atomixlib"""];
 :entry [  a :Entry;
           :title "The Darfur Wall" ;
           :author [ a :Person; :name "James Tauber"] ;
           :link [ a :Link;
                     :rel iana:alternate ;
                     :to [ :src <>;]          
:updated "2006-12-10T00:13:34Z"^^xsd:dateTime;
:published "2006-12-10T00:13:34Z"^^xsd:dateTime;
:id ""^^xsd:anyURI; ]

[Uche Ogbuji]

via Copia

Progress on two Reference Implementations for RETE and GRDDL

Whew! During the moments I was able to spare while at ISWC (I was only there on monday and tuesday, unfortunately), I finished up two 'reference' implementations I've been enthusiastically hacking on quite recently. By reference implementation, I mean an implementation that attempts to follow a specification verbatim as an exercise to get a better understanding of it.

I still need a write up on the engaging conversations I had at ISWC (it was really an impressive conference, even from just the 2 days worth I got to see) as well as the presentation I gave on using GRDDL and Web architecture to meet the requirements of Computer-based Patient Records.

FuXi and DLP

The first milestone was with FuXi, which I ended up rewriting completely based on some exchanges with Peter Lin and Charles Young.

This has probably been the most challenging piece of software I've ever written and I was quite niave in the beginning about my level of understanding of the nuances of RETE. Anyone interested in the formal intersection of Notation 3 / RDF syntax and the RETE algorithm(s) will find the exchanges in the comments of the above post very instructive - and Peter Lin's blog in general. Though he and I have our differences in the value of mapping RETE to RDF/N3 his insights into my efforts have been very helpful.

In the process, I discovered Robert Doorenbos PhD thesis "Production Matching for Large Learning Systems" which was incredibly valuable in giving me a comprehensive picture of how RETE (and RETE/UL) could be 'ported' to accomodate Notation 3 reasoning.

The primary motivation (which has led to many sleepless nights and late night hackery) is what I see as a lack of understanding within the community of semantic web developers of the limitations of Tableux-based reasoning and the common misconception that the major Tableaux-based reasoners (Fact++, Racer, Pellet, etc..) represent the ceiling of DL reasoning capability.

The reality is that logic programming has been around the block much longer than DL and has much more mature algorithms available (the primary one being RETE). I've invested quite a bit of effort in what I believe will (eventually) demonstrate very large-scale DL reasoning performance that will put Tableaux-based reasoning to shame - if only to make it clear that more investigation into the intersection of LP and DL is crucial for making the dream of the SW a verbatim reality.

Ofcouse, there is a limit to what aspects of DL can be expressed as LP rules (this subset is called Description Logic Programming). The 'original' DLP paper does well to explain this limitation, but I believe this subset represents the more commonly used portions of OWL and the portions of OWL 1.0 (and OWL 1.1 for that matter) left out by such an intersection will not be missed.
Ivan Herman pointed me to a paper by Horst, Herman which is quite comprehensive in outlining how this explicit intersection can be expressed axiomatically and the computational consequences of such an axiomatization. Ivan used this a guide for his RDFSClosure module.

Not enough (IMHO) has been done to explore this intersection because people are comfy with the confines of non-LP algorithms. The trail (currently somewhat cold) left by the Mindswap work on Pychinko needs to be picked up, followed and improved.

So, I rolled up my sleeves, dug deep and did my best to familiarize myself with the nuances of production system optimization. Most of the hard work has already been done, thanks to Robert Doorenbos subsetting (and extension) of the original Charles Forgy algorithm. FuXi, gets through a large majority the OWL tests using a ruleset that closely implements what Horst lays out in his paper and does so with impressive times - even with more optimizations pending.

The most recent changes include a command-line interface for launching it:

chimezie@Zion:~/devel/Fuxi$ python --out=n3 --ruleFacts
Time to build production rule (RDFLib): 0.0172629356384 seconds
Time to calculate closure on working memory: 224.906921387 m seconds

@prefix owl: <>.
@prefix rdf: <>.
@prefix test: <>.

 test:Animal test:in _:XRfZIlKy56,

 test:Cosi_fan_tutte a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,

 test:Don_Giovanni a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,

 .... snip ...

FuXi still doesn't support 'built-ins' (or custom comparisons), but luckily the thesis includes a section on how to implement non equality testing of rule constraints that should be (relatively) easy to add. The theseis also includes a section on how negated conditions can be implemented (which is probably the most glaring axiom missing from DLP). Finally, Robert's paper includes a clever mechanism for hashing the Alpha network that has yet to be implemented (and is simple enough to implement) that should contribute significant performance gains.

There are other pleasant surprises with the current codebase. The rule compiler can be used to identify inefficencies in rule patterns, the command-line program can be used to serialize the closure delta (i.e., only the triples inferred from the ruleset and facts), and (my favorite) a Notation 3 ruleset can be exported as a graphviz diagram in order to visualize the rule network. Having 'browsed' various rule-sets in this way, I must say it helps in understanding the nuances of optimization when you can see the discrimination network that the triples are propagated through.

I don't have a permanent home for FuXi yet but have applied for a sourceforge project (especially since SF now supports SVN, apparently). So, until then, FuXi can be downloaded from:

GRDDL Client for 4Suite and RDFLib

During the same period, I've also been working on a 'reference' implementation of GRDDL (per the recently released Working Draft) for 4Suite and RDFLib. It's a bit ironic in doing so, since the 4Suite repository framework has essentially been using GRDDL-like Content Management Systems mechanisms since its inception (sometime in 2001).
However, I thought doing so would be the perfect oppurtunity to:

  • Demonstrate how 4Suite can be used with RDFLib (as a prep to the pending deprecation of 4Suite RDF for RDFLib)
  • Build a framework to compose illustrative test cases for the Working Group (of which I'm a member)
  • As a way to completely familiarize myself with the various GRDDL mechanisms

I posted this to the Working Group mailing list and plan to continue working on it. In particular, the nice thing about the convergence of these two projects of mine is that I've been able to think a bit about how both GRDDL and Fuxi could be used to implement efficient, opportunistic programs that extract RDF via GRDDL and explicit links (rdf:seeAlso, owl:imports, rdfs:isDefinedBy) and perform incremental web closures by adding the triples discovered in this way one at a time to a live RETE network.

The RETE algorithm is tailored specifically to behave as a black box to changes to a production system and so crawling the web, extracting RDF (a triple at a time) and reasoning over it in this way (the ultimate semantic web scenario) becomes a very real scenario with sub-second response times to boot. At the very least, reasoning should cease to be much of a bottleneck compared to the actual dereferencing and parsing of RDF from distributed locations. Very exciting prospects. Watch this space..

Chimezie Ogbuji

via Copia

Parsing RDF from XSLT Prospectively

4Suite repository Document Definitions can now support both XML and text-based serialization of RDF. Document Definitions essentially facilitate database replication of XML to RDF (within a content management system that persists both). The mechanism is similar to transactional data replication in database management systems where modifications to a table triggers the replication. Previously, they were only expected to output to RDF/XML - which has well-known issues.

Now, the repository persistence driver attempts to parse the resulting RDF syntax based on the XSLT output method. This allows for a hueristic to prospectively attempt to accomodate non-XML syntax (such as Notation 3 - the only substantial RDF text-based syntax) as well as RDF/XML (and even TriX).

The main advantage for these syntax alternatives is a faster, more efficient parse time in addition to more human readable syntax (especially for data that was meant to be expressed in this way). This switching off the xsl:output method is analagous to switching off HTTP header content-type values for remote RDF graphs (where the parsing is also a bottleneck).

Ofcourse, 4Suite's aging RDF library doesn't properly perist N3 formulae (which are logic syntactic sugar specific to Notation 3) from the parser it uses.

Imagine using a Document Definition to, say, replicate SWRL's XML syntax into Notation 3's implication syntax for a logic programming database:

if x1 hasParent x2, x2 hasSibling x3, and x3 hasSex male, then x1 hasUncle x3


  <ruleml:_rlab ruleml:href="#example1"/>
    <swrlx:individualPropertyAtom  swrlx:property="hasParent"> 
    <swrlx:individualPropertyAtom  swrlx:property="hasBrother"> 
    <swrlx:individualPropertyAtom  swrlx:property="hasUncle"> 

Notation 3 Rule

{ ?x1 :hasParent ?x2; ?x2 :hasSibling ?x3; ?X3 :hasSex :Male } => { ?x1 :hasUncle ?x3 }.

Now imagine using GRDDL to publish a common set of rules as SWRL, with a profile to transform them to Notation 3 for scutters that understand.

Chimezie Ogbuji

via Copia

What does GRDDL have to do with Intelligent Agents?

GRDDL. What is it? Why the long name? It does something very specific that requires a long name to describe it. Etymology of biological names includes examples of the same phenomenon in a different discipline. I starting writing on this weblog mainly as a way to regularly excercise my literary expression, so (to that end) I'm going to try to explain GRDDL in as few words as I can while simultaneously embelishing.

It is a language (or dialect) translator. It Gleans (gathers or harvests) Resource Descriptions. Resource Descriptions can be thought to refer to the use of constructs in Knowledge Representation. These constructs are often used to make assertions about things in sentence form - from which additional knowledge can be infered. However, it is also the 'Resource Description' in RDF (no coincidence there). RDF is the target dialect. GRDDL acts as an intelligent agent (more on this later) that performs translations from specific (XML) vocabularies, or Dialects of Languages to abstract RDF syntax.

Various languages can be used but there is a natural emphasis on a language (XSLT) with a native ability to process XML.

GRDDL is an XML & RDF formalism in what I think is a hidden pearl of web architecture: a well-engineered environment for distributed processing by intelligent agents. It's primarily the well-engineered nature of web architecture that lends the neccessary autonomy that intelligent agents require. Though hidden, there is much relevance with contemporaries, predecessors, and distant cousins:

It earns its keep mostly with small, well-designed XML formats. As a host language for XSLT it sets out to be (perhaps) a bridge across the great blue and red divide of XML & RDF. To quote a common parlance: watch this space.


Chimezie Ogbuji]

via Copia

Optimizing XML to RDF mappings for Content Management Persistence

I recently re-factored the 4Suite repository's persistent layer for the purpose of making it more responsive to large sets of data. The 4Suite repository's persistence stack – which consists of a set of core APIs for the various first class resources - is the heart and sole of a framework that leverages XML and RDF in tandem as a platform for content management. Essentially, the changes minimized the amount of redundant RDF statements mirrored into the system graph (an RDF graph where provenance statements about resources in the virtual filesystem are persisted) from the metadata XML documents associated with every resource in the repository.

The ability to mirror RDF content from XML documents in a controlled manner is core to the repository and the way it manages it's virtual filesystem. This mapping is made possible by a mechanism called document definitions. Document definitions are mappings (persisted as documents in the 4Suite repository) of controlled XML vocabularies into corresponding RDF statements. Every resource has a small 'metadata' XML document associated with it that captures ACL data as well as system-level provenance typically associated with filesystems.

For example, the metadata document for the root container of the 4Suite instance running on my laptop is:

<?xml version="1.0" encoding="utf-8"?>
  type="" creation-date="2006-03-26T00:35:02Z">
    <ftss:Access ident="owner" type="execute" allowed="1"/>  
    <ftss:Access ident="world" type="execute" allowed="1"/> 
    <ftss:Access ident="super-users" type="execute" allowed="1"/>  
    <ftss:Access ident="owner" type="read" allowed="1"/>
    <ftss:Access ident="world" type="read" allowed="1"/>    
    <ftss:Access ident="super-users" type="read" allowed="1"/>  
    <ftss:Access ident="owner" type="write user model" allowed="1"/>
    <ftss:Access ident="super-users" type="write user model" allowed="1"/>  
    <ftss:Access ident="owner" type="change permissions" allowed="1"/>  
    <ftss:Access ident="super-users" type="change permissions" allowed="1"/>
    <ftss:Access ident="owner" type="write" allowed="1"/> 
    <ftss:Access ident="super-users" type="write" allowed="1"/> 
    <ftss:Access ident="owner" type="change owner" allowed="1"/> 
    <ftss:Access ident="super-users" type="change owner" allowed="1"/>
    <ftss:Access ident="owner" type="delete" allowed="1"/>
    <ftss:Access ident="super-users" type="delete" allowed="1"/>

Each ftss:Access element under ftss;Acl represents an entry in the ACL associated with the resource the metadata document is describing. All the ACL accesses enforced by the persistence layer are documented here.

Certain metadata are not reflected into RDF, either because they are queried more often than others and require prompt response or because they are never queried separately from the resource they describe. In either case, querying a small-sized XML document (associated with an already identified resource) is much more efficient than dispatching a query against an RDF graph in which statements about every resource in the repository are assserted.

ACLs are an example and are persisted only as XML content. The persistence layer interprets and performs ACL operations against XML content via XPath / Xupdate evaluations.

Prior to the change, all of the other properties embedded in the metadata document (listed below) were being reflected into RDF redundantly and inefficiently:

  • @type
  • @creation-date
  • @document-definition
  • ftss:LastModifiedDate
  • ftss:Imt
  • ftss:Size
  • ftss:Owner
  • ftss:TimeToLive

Not too long ago, I hacked (and wrote a piece on it) up an OWL ontology describing these system-level RDF statements.

Most of the inefficiency was due to the fact that a pre-parsed Domlette instance of the metadata document for each resource was already being cached by the persistence layer. However the corresponding APIs for these properties (getLastModifiedDate, for example) were being implemented as queries against the mirrored RDF content. Modifying these methods to evaluate pre-compiled XPaths against the cached DOM instances proved to be several orders of magnitudes more efficient, especially against a repository with a large number of resources in the virtual filesystem.

Of all the above 'properties', only @type (which was being mirrored as rdf:type statemements in RDF), @document-definition, and ftss:TimeToLive were being queried independently from the resources they are associated with. For example, the repository periodically monitors the system RDF graph for ftss:TimeToLive statements whose values are less than the current date time (which indicates their TTL has expired). Expired resources can not be determined by XPath evaluations against metadata XML documents, since XPath is scoped to a specific document by design. If the metadata documents were persisted in a native XML store then the same queries could be dispatched (as an XQuery) across all the metadata documents in order to identify those whose TTL had expired. But I digress...

The @document-defintion attribute associates the resource (an XML document in this case) with a user-defined mapping (expressed as an XSLT transform or a set of XPath to RDF statement templates) which couples it's content with corresponding RDF statements. This presents a interesting scenario where if a document definition changes (document definitions are themselves first-class resources in the repository), then all the documents which refer to it must have their RDF statements re-mapped using the new document definition.

Note, such coupling only works in a controlled, closed system and isn't possible where such mappings from XML to RDF are uncontrolled (ala GRDDL) and work in a distributed context.

At any rate, the @document-definition property was yet another example of system metadata that had to be mirrored into the system RDF graph since document definitions need to be identified independently from the resources that register them.

In the end, only the metadata properties that had to be queried in this fashion were mirrored into RDF. I found this refactoring very instructive in identifying some caveats to be aware of when modeling large scale data sets as XML and RDF interchangeably. This very small architectural modification yielded quite a significant performance boost for the 4Suite repository, which (as far as I can tell) is the only content-management system that leverages XML and RDF as interchangeable representation formats in such an integrated fashion.

[Uche Ogbuji]

via Copia

Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia