Triclops gets a facelift, new query management capabilities, and new APIs

by Chimezie Ogbuji

I recently had a need to manage a set of queries against an OWL2 EL biomedical ontology: the Foundational Model of Anatomy. I have an open source SPARQL service implementation that I had some thoughts about extending with support for managing queries. It’s called Triclops and is part of a collection of RDF libraries and tools I have been accumulating. The name is a reference to an initial attempt to build an RDF querying and navigation interface as part of the 4Suite repository back in the day (circa 2002).

This later evolved to a very rudimentary web interface that sat in front of the Oracle 11g and MySQL/SPARQL patient dataset that Cyc’s SKSI interacted with. This was part of an interface tailored to the task of identifying patient cohorts, known as the Semantic Research Assistant (SRA). A user could dispatch handwritten SPARQL queries, browse clickable results, or return them as CSV files. This capability was only used by informaticians familiar with the structure of the RDF dataset and most investigators used the SRA.

It also implemented a RESTful protocol for ticket-based querying that was used for stopping long-running SPARQL/MySQL queries. This is not currently documented. Around the time this was committed as an Apache-licensed, Google code library, layercake-python added core support for APIs that treated remote SPARQL services as local Graph objects as well as general support for connecting SPARQL services. This was based on Ivan Herman’s excellent SPARQL Endpoint interface to Python.

Triclops (as described in the wiki) can now be configured as a “Proxy SPARQL Endpoint”. It can be deployed as a light-weight query dispatch, management, and mediation kiosk for remote and local RDF datasets. The former capability (dispatching) was already in place, the latter (mediation) can be performed using FuXi’s recent capabilities in this regard.

Specifically, FuXi includes an rdflib Store that uses its sideways-information passing (sip) strategies the in-memory SPARQL algebra implementation for use as a general-purpose framework for semweb SPARQL (OWL2-RL/RIF/N3) entailment regimes. Queries are mediated over the SPARQL protocol using global schemas captured as various kinds of semweb ontology artifacts (expressed in a simple Horn form) that describe and distinguish their predicates by those instantiated in a database (or factbase) and those derived via the semantic properties of these artifacts.

So the primary capability that remained was query management and so this recent itch was scratched over the holidays. I discovered that CodeMirror , a JavaScript library that can be used to create a relatively powerful editor interface for code, had excellent support for SPARQL. I integrated it into Triclops as an interface for managing SPARQL queries and their results. I have a running version of this at http://metacognition.info/sparql/queryMgr. Note, the service is liable to break at any point as Webfaction kills of processes that use up alot of CPU and I have yet to figure out how to configure it to restart the service when it dies in such a fashion.

The dataset this interface manages queries for is a semantic web of content comprising 3 of the primary, ancient Chinese, classical texts (the Analects, Doctrine of the Mean, and the Tao Te Ching). I record the information in RDF because it is an intuitive knowledge representation to use in capturing provenance, exposition, and other editorial meta data. Below is a screen shot of the main page listing a handful of queries, their name, last date of modification, date of last run, and number of solutions in the recent result.

Main SPARQL service page

Above the list is a syntax-highlighted text area for dispatching adhoc SPARQL queries. This is where CodeMirror is integrated. If I click on the name of the query titled “Query for Analects and the Doctrine of the Mean english chapter text (Confucius)”, I go to a similar screen with another text area whose content corresponds to the text of the query (see the screen shot below).

Main SPARQL service page

From here queries can be updated (by submitting updated CodeMirror content) or cloned (using the name field for the new copy). Alternatively, the results of previous queries can be rendered. This sends back a result document with an XSLT processing instruction that causes the browser to trigger a request for a stylesheet and render an XHTML document from content in the result document on the client side.

Finally, a query can be re-executed against a dataset, saving the results and causing the information in the first screen to show different values for the last execution run (date and number of solutions). Results can also be saved or viewed as CSV using a different stylesheet against the result document.

The last capability added is a rudimentary template system where any variable in the query or text string of the form ‘$ …. $’ is replaced with a provided string or a URI. So, I can change the pick list value on the second row of the form controls to $searchExpression$ and type “water”. This produces a SPARQL query (visible with syntax highlighting via CodeMirror) that can be used as a template to dispatch queries against the dataset.

In addition, solutions for a particular variable can be used for links, providing a small framework for configurable, navigation workflows. If I enter “[Ww]ater” in the field next to $searchExpression$, select classic from the pick list at the top of the Result navigation template area, pick “Assertions in a (named) RDF graph” from the next pick list, and enter the graphIRI variable in the subsequent text input.

Triggering this form submission will produce the result screen pictured below. As specified in the form, clicking any of the the dbpedia links for the Doctrine of the Mean will initiate the invokation of the query titled “Assertions in a (named) RDF graph”, and shown below (with the graphIRI variable pre-populated with the corresponding URI):

SELECT DISTINCT ?s ?p ?o where {
    GRAPH ?graphIRI {
      ?s ?p ?o
    }
}

Main SPARQL service page

The result of such an action is shown in the screen shot. Alternatively, a different subsequent query can be used: “Statements about a resource”. The relationship between the schema of a dataset and the factbase can be navigated in a similar way. Picking the query titled “Classes in dataset” and making the following modifications. Select “Instances of a class and graph that the statements are asserted in” from the middle pick list of the Result navigation template section. Enter ?class in the text field to the right of this. Selecting ‘Execute..’ and executing this query results in a clickable result set comprised of classes of resources and clicking any such link shows the instances of that class.

Main SPARQL service page

This latter form of navigation seems well suited for exploring datasets for which either there is no schema information in the service or it is not well known by the investigator writing the queries.

In developing this interface, at least 2 architectural principles were re-used from my SemanticDB development days: the use of XSLT on the client side to build rich, offloaded (X)HTML applications and the use of the filesystem for managing XML documents rather than a relational database. The latter (use of a filesystem) is particularly more relevant where querying across the documents is not a major requirement or even a requirement at all. The former is via setting the processing instruction of a result document to refer to a dynamically generated XSLT document on the server.

The XSLT creates a tabular, row-distinguishing, tabular interface where the links to certain columns trigger queries via a web API that takes various input, including: the variable in the current query whose solutions are ‘streamed’, a (subsequent) query specified by some function of the MD5 hash of its title, a variable in that query that is pre-populated with the corresponding solution, etc:

../query=...&action=update&innerAction=execute,templateValue=...,&valueType=uri&variable=..

Eventually, the API should probably be made more RESTful and target the query, possibly leveraging some caching mechanism in the process. Perhaps it can even work in concert with the SPARQL 1.1 Graph Store HTTP Protocol.

Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
Options:
 -h, --help            show this help message and exit
 -q QUOTECHAR, --quoteChar=QUOTECHAR
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
 -o FILEPATH, --output=FILEPATH
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   cnt=Counter()
   @coroutine
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               cnt.counter+=1
           else:
               rt=[]
               badChars = False
               for binding in node.binding:
                   try:
                       rt.append(U(binding).encode('ascii'))
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
               csvWriter.writerow(rt)
       return
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   target.close()
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

Symmetric Multi Processor Evaluation of an RDF Query

I've been doing alot of thinking (and prototyping) of Symmetric Multi Processor evaluation of RDF queries against large RDF datasets. How could you Map/Reduce an evaluation of SPARQL / Versa (ala HbaseRDF), for instance? Entailment-free Graph matching is the way to go at large volumes. I like to eat my own dog food, so I'll start with my toolkit.

Let us say you had a large RDF dataset of historical characters modelled in FOAF, where the Graph names match the URI's assigned to the characters. How would you evaluate a query like the follwing (both serially and in concurrent fashion)?

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>
SELECT ?mbox
WHERE {
    ?kingWen :name "King Wen";
             rel:parentOf ?son . 
    GRAPH ?son { [ :mbox ?mbox ] }.
}

Written in a the SPARQL abstract syntax:

Join(BGP(?kingWen foaf:name "King Wen". ?kingWen rel:parentOf ?son),Graph(?son,BGP(_:a foaf:mbox ?mbox)))

If you are evaluating it niavely, it would reduce (via algebra) to:

Join(eval(Dh, BGPa), eval(Dh, Graph(?son,BGPb)))

Where Dh denotes the RDF dataset of historical literary characters, BGPa denotes the BGP expression

?kingWen foaf:name "King Wen". ?kingWen rel:parentOf ?son

BGPb denotes

_:a :mbox ?mbox

The current definition of the Graph operator as well as the one given by Perez, Jorge et.al. seem (to me) amenable to parallel evaluation. Let us take a look at the operational semantics of evuating the same query in Versa:

map(
   (*|-foaf:name->"King Wen")-rel:parentOf->*,
   'scope(".-foaf:mbox->*",.)',
)

Versa is a Graph-traversal-based RDF querying language. This has the advantage a computational graph theory that we can rely on for analysis of a query's complexity. Parallel evaluation of (directed) Graph traversals appear to be an open problem for a deterministic turing machine. The input to the map function would be the URIs of the sons of King Wen. The map function would be the evaluation of expression:

scope(".-foaf:mbox->*",.)

This would seem to be the equivalent of the evaluation of the Graph(..son URI..,BGPb) SPARQL abstract expression. So far, so good. Parallel evaluation can be implemented in a manner that is transparent to the application. An analysis of the evaluation using some complexity theory would be interesting to see if RDF named graph traversal queries have a data complexity that scales.

Chimezie Ogbuji

via Copia

Musings of a Semantic / Rich Web Architect: What's Next?

I'm writing this on my flight back from XTech 2007, Paris, France. This gives me a decent block of time to express some thoughts and recent developments. This is the only significant time I've had in a while to do any writing.
My family

Between raising a large family, software development / evangelism, and blogging I can only afford to do two of these. So, blogging loses out consistently.

My paper (XML-powered Exhibit: A Case Study of JSON and XML Coexistence) is now online. I'll be writing a follow-up blog on how http://planetatom.net demonstrates some of what was discussed in that paper. I ran into some technical difficulties with projecting from Ubuntu, but the paper covers everything in detail. The slides are here.

My blog todo list has become ridiculously long. I've been heads-down on a handful of open source projects (mostly semantic web related) when I'm not focusing on work-related software development.
Luckily there has been a very healthy intersection of the open source projects I work on and what I do at work (and have been doing non-stop for about 4 years). In a few cases, I've spun these 'mini-projects' off under an umbrella project I've been working on called python-dlp. It is meant (in the end) to be a toolkit for semantic web hackers (such as myself) who want to get their hands dirty and have an aptitude for Python. There is more information on the main python-dlp page (linked above).

sparql-p evaluation algorithm Some of the other things I've been working on I'd prefer to submit to appropriate peer-reviewed outlets considering the amount of time I've put into them. First, I really would like to do a 'proper' write-up on the map/reduce approach for evaluating SPARQL Algebra expressions and the inner mechanics of Ivan Herman's sparql-p evaluation algorithm. The latter is one of those hidden gems I've become closely familiar with for some time that I would very much like to examine in a peer-reviewed paper especially if Ivan is interested doing so in tandem =).

Since joining the W3C DAWG, I've had much more time to get even more familiar with the formal semantics of the Algebra and how to efficiently implement it on-top of sparql-p to overcome the original limitation on the kinds of patterns it can resolve.

I was hoping (also) to release and talk a bit about a SPARQL server implementation I wrote in CherryPy / 4Suite / RDFLib for those who may find it useful as a quick and dirty way to contribute to the growing number of SPARQL endpoints out there. A few folks in irc:///freenode.net/redfoot (where the RDFLib developers hang out) have expressed interest, but I just haven't found the time to 'shrink-wrap' what I have so far.

On a different (non-sem-web) note, I spoke some with Mark Birbeck (at XTech 2007) about my interest in working on a 4Suite / FormsPlayer demonstration. I've spent the better part of 3 years working on FormsPlayer as a client-side platform for XML-driven applications served from a 4Suite repository and I've found the combination quite powerful. FormsPlayer (and XForms 1.1 specifically) is really the icing on the cake which takes an XML / RDF Content Management System like the 4Suite repository and turns it into a complete platform for deploying next generation rich web applications.

The combination is a perfect realization of the Rich Web Application Backplane (a reoccurring theme in my last two presentations / papers) and it is very much worth noting that some of the challenges / requirements I've been able address with this methodology can simply not be reproduced in any other approach: neither vanilla DHTML, .NET, J2EE, Ruby on Rails, Django, nor Jackrabbit. The same is probably the case with Silverlight and Apollo.

In particular, when it comes to declarative generation of user interfaces, I have yet to find a more complete approach than via XForms.

Mark Birbeck's presentation on Skimming is a good read (slides / paper is not up yet) for those not quite familiar with the architectural merits of this larger methodology. However, in his presentation eXist was used as the XML store and it struck me that you could do much more with 4Suite instead. In particular, as a CMS with native support for RDF as well as XML it opens up additional avenues. Consider extending Skimming by leveraging the SPARQL protocol as an additional mode of expressive communication beyond 'vanilla' RESTful operations on XML documents.

These are very exciting times as the value proposition of rich web (I much prefer this term over the much beleaguered Web 2.0+) and semantic web applications has fully transitioned from vacuous / academic musings to concretely demonstrable in my estimation. This value proposition is still not being communicated as well as it could, but having bundled demos can bridge this gap significantly in my opinion; much more so than just literature alone.

This is one of the reasons why I've been more passionate about doing much less writing / blogging and more hands-on hacking (if you will). The original thought (early on this year) was that I would have plenty to write about towards the middle of this year and time spent discussing the ongoing work would be premature. As it happens, things turned out exactly this way.

There is a lesson to be learned for how the Joost project progressed to where it is. The approach of talking about deployed / tested / running code has worked perfectly for them. I don't recall much public dialog about that particular effort until very recently and now they have running code doing unprecedented things and the opportunity (I'm guessing) to switch gears to do more evangelism with a much more effective 'wow' factor.

Speaking of wow, I must say of all the sessions at XTech 2007, the Joost session was the most impressive. The number of architectures they bridged, the list of demonstrable value propositions, the slick design, the incredibly agile and visionary use the most appropriate technology in each case etc.. is an absolutely stunning achievement.

The fact that they did this all while remembering their roots: open standards, open source, open communities leaves me with a deep sense of respect for all those involved in the project. I hope this becomes a much larger trend. Intellectual property paranoia and cloak / dagger completive edge is a thing of the past in today's software problem solving landscape. It is a ridiculously outdated mindset and I hope those who can effect real change (those higher up in their respective ORG charts than the enthusiastic hackers) in this regard are paying close attention. Oh boy. I'm about to launch into a rant, so I think I'll leave it at that.

The short of it is that I'm hoping (very soon) to switch gears from heads-down design / development / testing to much more targeted write-ups, evangelism, and such. The starting point (for me) will be Semantic Technology Conference in San Jose. If the above topics are of interest to you, I strongly suggest you attend my colleague's (Dr. Chris Pierce) session on SemanticDB (the flagship XML & RDF CMS we've been working on at the Clinic as a basis for Computerized Patient Records) as well as my session on how we need to pave a path to a new generation of XML / RDF CMSes and a few suggestions on how to go about paving this path. They are complementary sessions.

Jackrabbit architecture

JSR 170 is a start in the right direction, but the work we've been doing with the 4Suite repository for some time leaves me with the strong, intuitive impression that CMSes that have a natural (and standardized) synthesis with XML processing is only half the step towards eradicating the stronghold that monolithic technology stacks have over those (such as myself) with 'enterprise' requirements that can truly only be met with the newly emerging sets of architectural patterns: Semantic / Rich Web Applications. This stronghold can only be eradicated by addressing the absence of a coherent landscape with peer-reviewed standards. Dr. Macro has an incredibly visionary series of 'write-ups' on XML CMS that paints a comprehensive picture of some best practices in this regard:

However (as with JSR 170), there is no reason why there isn't a bridge or some form of synthesis with RDF processing within the confines of a CMS.

There is no good reason why I shouldn't be able to implement an application which is written against an abstract API for document and knowledge management irrespective of how this API is implemented (this is very much aligned with larger goal of JSR 170). There is no reason why the 4Suite repository is the only available infrastructure for supporting both XML and RDF processing in (standardized) synthesis.

I should be able to 'hot-swap' RDFLib with Jena or Redland, 4Suite XML with Saxon / Libxml / etc.., and the 4Suite repository with an implementation of a standard API for synchronized XML / RDF content management. The value of setting a foundation in this arena is applicable to virtually any domain in which a CMS is a necessary first component.

Until such a time, I will continue to start with 4Suite repository / RDFLib / formsPlayer as a platform for Semantic / Rich Web applications. However, I'm hoping (with my presentation at San Jose) to paint a picture of this vacuum with the intent of contributing towards enough of a critical mass to (perhaps) start putting together some standards towards this end.

Chimezie Ogbuji

via Copia

Compositional Evaluation of W3C SPARQL Algebra via Reduce/Map

[by Chimezie Ogbuji]

Committed to svn

<CIA-16> chimezie * r1132 rdflib/sparql/ (Algebra.py bison/Processor.py bison/SPARQLEvaluate.py): Full implementation of the W3C SPARQL Algebra. This should provide coverage for the full SPARQL grammar (including all combinations of GRAPH). Includes unit testing and has been run against the old DAWG testsuite.

Tested against older DAWG testsuite. Implemented using functional programming idioms: fold (reduce) / unfold (map)

Does that suggest parallelizable execution?

reduce(lambda left,right: ReduceToAlgebra(left,right),{ .. triple patterns .. } => expression

expression -> sparql-p -> solution mappings

GRAPH ?var / <.. URI ..> support as well.

The only things outstanding (besides the new modifiers and non-SELECT query forms), really, are:

  • a pluggable extension mechanism
  • support for an exploratory protocol
  • a way for Fuxi to implement entailment.
  • other nice-to-haves..

.. Looking forward to XTech 2007 and Semantic Technology Conference '07

Chimezie Ogbuji

via Copia

Patterns and Optimizations for RDF Queries over Named Graph Aggregates

In a previous post I used the term 'Conjunctive Query' to refer to a kind of RDF query pattern over an aggregation of named graphs. However, the term (apparently) has already-established roots in database querying and has a different meaning that what I intended. It's a pattern I have come across often and is for me a major requirement for an RDF query language, so I'll try to explain by example.

Consider two characters, King (Wen) and his heir / son (Wu) of the Zhou Dynasty. Let's say they each have a FOAF graph about themselves and the people they know within a larger database which holds the FOAF graphs of every historical character in literature.

The FOAF graphs for both Wen and Wu are (preceeded by the name of each graph):

<urn:literature:characters:KingWen>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wen_of_Zhou> a :Person;
    :name “King Wen”;
    :mbox <mailto:kingWen@historicalcharacter.com>;
    rel:parentOf [ a :Person; :mbox <mailto:kingWu@historicalcharacter.com> ].

<urn:literature:characters:KingWu>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wu_of_Zhou> a :Person;
    :name “King Wu”;
    :mbox <mailto:kingWu@historicalcharacter.com>;
    rel:childOf [ a :Person; :mbox <mailto:kingWen@historicalcharacter.com> ].

In each case, Wikipedia URLs are used as identifiers for each historical character. There are better ways for using Wikipedia URLs within RDF, but we'll table that for another conversation.

Now lets say a third party read a few stories about “King Wen” and finds out he has a son, however, he/she doesn't know the son's name or the URL of either King Wen or his son. If this person wants to use the database to find out about King Wen's son by querying it with a reasonable response time, he/she has a few thing going for him/her:

  1. foaf:mbox is an owl:InverseFunctionalProperty and so can be used for uniquely identifying people in the database.
  2. The database is organized such that all the out-going relationships (between foaf:Persons – foaf:knows, rel:childOf, rel:parentOf, etc..) of the same person are asserted in the FOAF graph associated with that person and nowhere else.
    So, the relationship between King Wen and his son, expressed with the term ref:parentOf, will only be asserted in
    urn:literature:characters:KingWen.

Yes, the idea of a character from an ancient civilization with an email address is a bit cheeky, but foaf:mbox is the only inverse functional property in FOAF to use to with this example, so bear with me.

Now, both Versa and SPARQL support restricting queries with the explicit name of a graph, but there are no constructs for determining all the contexts of an RDF triple or:

The names of all the graphs in which a particular statement (or statements matching a specific pattern) are asserted.

This is necessary for a query plan that wishes to take advantage of [2]. Once we know the name of the graph in which all statements about King Wen are asserted, we can limit all subsequent queries about King Wen to that same graph without having to query across the entire database.

Similarly, once we know the email of King Wen's son we can locate the other graphs with assertions about this same email address (knowing they refer to the same person [1]) and query within them for the URL and name of King Wen's son. This is a significant optimization opportunity and key to this query pattern.

I can't speak for other RDF implementations, but RDFLib has a mechanism for this at the API level: a method called quads((subject,predicate,object)) which takes three terms and returns a tuple of size 4 which correspond to the all triples (across the database) that match the pattern along with the graph that the triples are asserted in:

for s,p,o,containingGraph in aConjunctiveGraph.quads(s,p,o):
  ... do something with containingGraph ..

It's likely that most other QuadStores have similar mechanisms and given the great value in optimizing queries across large aggregations of named RDF graphs, it's a strong indication that RDF query languages should provide the means to express such a mechanism.

Most of what is needed is already there (in both Versa and SPARQL). Consider a SPARQL extension function which returns a boolean indicating whether the given triple pattern is asserted in a graph with the given name:

rdfg:AssertedIn(?subj,?pred,?obj,?graphIdentifier)

We can then get the email of King Wen's son efficiently with:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?mbox
WHERE {
    GRAPH ?foafGraph {
      ?kingWen :name “King Wen”;
                       rel:parentOf [ a :Person; :mbox ?mbox ].
    }  
     FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?foafGraph) ).
}

Now, it is worth noting that this mechanism can be supported explicitly by asserting provenance statements associating the people the graphs are about with the graph identifiers themselves, such as:

<urn:literature:characters:KingWen> 
  :primaryTopic <http://en.wikipedia.org/wiki/King_Wen_of_Zhou>.

However, I think that the relationship between an RDF triple and the graph in which it is asserted, although currently outside the scope of the RDF model, should have it's semantics outlined in the RDF abstract syntax instead of using terms in an RDF vocabulary. The demonstrated value in RDF query optimization makes for a strong argument:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?kingWu,  ?sonName
WHERE {
    GRAPH ?wenGraph {
      ?kingWen :name “King Wen”;
                       :mbox ?wenMbox;
                       rel:parentOf [ a :Person; :mbox ?wuMbox ].
    }  
    FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?wenGraph) ).
    GRAPH ?wuGraph {
      ?kingWu :name ?sonName;
                     :mbox ?wuMbox;
                     rel:childOf [ a :Person; :mbox ?wenMbox  ].
    }  
     FILTER (rdfg:AssertedIn(?kingWu,:name,?sonName,?wuGraph) ).
}

Generally, this pattern is any two-part RDF query across a database (a collection of multiple named graphs) where the scope of the first part of the query is the entire database, identifies terms that are local to a specific named graph, and the scope of the second part of the query is this named graph.

Chimezie Ogbuji

via Copia

Comprehensive RDF Query API's for RDFLib

[by Chimezie Ogbuji]

RDFLib's support for SPARQL has come full circle and I wasn't planning on blogging on the developments until they had settled some – and they have. In particular, the last piece was finalizing a set of APIs for querying and result processing that fit well within the framework of RDFLib's various Graph API's. The other issue was for the query APIs to accomodate eventual support for other querying languages (Versa for instance) that are capable of picking up the slack where SPARQL is wanting (transitive closures, for instance – try composing a concise SPARQL query for calculating the transitive closure of a given node along the rdfs:subClassOf property and you'll immediately see what I mean).

Querying

Every Graph instance now has a query method through which RDF queries can be dispatched:

def query(self, strOrQuery, initBindings={}, initNs={}, DEBUG=False, processor="sparql")
    """
    Executes a SPARQL query (eventually will support Versa queries with same method) against this Conjunctive Graph
    strOrQuery - Is either a string consisting of the SPARQL query or an instance of rdflib.sparql.bison.Query.Query
    initBindings - A mapping from variable name to an RDFLib term (used for initial bindings for SPARQL query)
    initNS - A mapping from a namespace prefix to an instance of rdflib.Namespace (used for SPARQL query)
    DEBUG - A boolean flag passed on to the SPARQL parser and evaluation engine
    processor - The kind of RDF query (must be 'sparql' until Versa is ported)
    """

The first argument is either a query string or a pre-compiled query object (compiled using the appropriate BisonGen mechanism for the target query language). Pre-compilation can be useful for avoiding redundant parsing overhead for queries that need to be evaluated repeatedly:

from rdflib.sparql.bison import Parse
queryObject = Parse(sparqlString)

The next argument (initBindings) is dictionary that maps variables to their values. Though variables are common to both languages, SPARQL variables differ from Versa queries in that they are string terms in the form “?varName”, wherease in Versa variables are QNames (same as in Xpath). For SPARQL queries the dictionary is expected to be a mapping from variables to RDFLib terms. This is passed on to the SPARQL processor as initial variable bindings.

initNs is yet another top-level parameter for the query processor: a namespace mapping from prefixes to namespace URIs.

The debug flag is pretty self explanatory. When True, it will cause additional print statements to appear for the parsing of the query (triggered by BisonGen) as well as the patterns and constraints passed on to the processor (for SPARQL queries).

Finally, the processor specifies which kind of processor to use to evaluate the query: 'versa' or 'sparql'. Currently (with emphasis on 'currently'), only SPARQL is supported.

Result formats

SPARQL has two result formats (JSON and XML). Thanks to Ivan Herman's recent contribution the SPARQL processor now supports both formats. The query method (above) returns instances of QueryResult, a common class for RDF query results which define the following method:

def serialize(self,format='xml')

The format argument determines which result format to use. For SPARQL queries, the allowable values are: 'graph' – for CONSTRUCT / DESCRIBE queries (in which case a resulting Graph object is returned), 'json',or 'xml'. The resulting object also behaves as an iterator over the bindings for manipulation in the host language (Python).

Versa has it's own set of result formats as well. Primarily there is an XML result format (see: Versa by Example) as well as Python classes for the various internal datatypes: strings,resources,lists,sets,and numbers. So, the eventual idea is for using the same method signature for serializing Versa queries as XML as you would for SPARQL queries.

SPARQL Limitations

The known SPARQL query forms that aren't supported are:

  • DESCRIBE/CONSTRUCT (very temporary)
  • Nested Group Graph Patterns
  • Graph Patterns can only be used (once) by themselves or with OPTIONAL patterns
  • UNION patterns can only be combined with OPTIONAL patterns
  • Outer FILTERs which refer to variables within an OPTIONAL

Alot of the above limitations can be addressed with formal equivalence axioms of SPARQL semantics, such as those mentioned in the recent paper on the complexity and semantics of SPARQL. Since there is very little guidance on the semantics of SPARQL I was left with the option of implementing only those equivalences that seemed obvious (in order to support the patterns in the DAWG test suite):

1) { patternA { patternB } } => { patternA. patternB }
2) { basicGraphPatternA OPTIONAL { .. } basicGraphPatternB }
  =>
{ basicGraphPatternA+B OPTIONAL { .. }}

It's still a work in progress but has come quite a long way. CONSTRUCT and DESCRIBE are already supported by the underlying processor, I just need to find some time to hook it up to the query interfaces. Time is something I've been real short of lately.

Chimezie Ogbuji

via Copia

SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the setup.py had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

  • SPARQL.bgen (the main file that includes the others)
  • SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
  • SPARQLTokens.bgen.frag (Token definitions)
  • SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
  • SPARQLLexerPatterns.bgen.frag (lexer definition)
  • SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
  • SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

  • I wanted to get it through parsing the queries in the test case
  • Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
  • Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

  • Test round-tripping of SPARQL queries
  • Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia

Extension Functionality and Set Manipulation in RDF Query Languages

A recent bit by Andy Seaborne (on Property Functions in ARQ – Jena's query engine) got me thinking about general extension mechanisms for RDF querying languages.
In particular, he mentions two extensions that provide functionality for processing RDF lists and collections which (ironically) coincide with functions I had requested be considered for later generations of Versa.

The difference, in my case, was that the suggestions were for functions that cast RDF lists into Versa lists (or sets) – which are data structures native to Versa that can be processed with certain built-in functions.

Two other extensions I use quite often in Versa (and mentioned briefly in my XML.com article) are scope, and scoped-subquery. These have to do with identifying the context of a resource and limiting the evaluation of a query to within a named graph, respectively. Currently, the scoped function returns a list of the names of all graphs in which the resource is asserted as a member of any class (via rdf:type). I could imagine this being expanded to include the names of any graph in which statements about the resource are asserted. scoped-subquery doesn't present much value for a host language that can express queries as limited to a named context.

I also had some thoughts about an extension function mechanism that allowed an undefined function reference (for functions of arity 1 – i.e. functions that take only a single argument) to be interpreted as a request for all the objects of statements where the predicate is the function URI and the subject is the argument

I recently finished writing a SPARQL grammar for BisonGen and hope to conclude that effort (at some point) once I get over a few hurdles. I was pleasantly surprised to find that the grammar for function invocation is pretty much identical for both query languages. Which suggests that there is room for some thought about a common mechanism (or just a common set of extension functionality – similar to the EXSLT effort) for RDF querying or general processing.

CWM has a rich, and well documented set of RDF extensions. The caveat is that the method signatures are restricted to dual input (subject and object) since the built-ins are expressed as RDF triples where the predicate is the name of the function and the subject and object are arguments to is. Nevertheless, it is a good source from which an ERDF specification could be drafted.

My short wish-list of extension functions in such an effort would include:

  • List comprehension (intersection, union, difference, membership, indexing, etc.)
  • Resolving the context for a given node: context(rdfNode) => URI (or BNode?)
  • an is-a(resource) function (equivalent to Versa's type function without the entailment hooks)
  • a class(resource) which is an inverse of is-a
  • Functions for transitive closures and/or traversals (at the very least)
  • A fallback mechanism for undefined functions that allowed them to be interpreted as unary 'predicate functions'

Of course, functions which return lists instead of single resources would be problematic for any host language that can't process lists, but it's just some food for thought.

Chimezie Ogbuji

via Copia

Wrapping rdflib's Graph around a 4RDF Model

Well, for some time I had pondered what it would take fo provide SPARQL support in 4Suite RDF. I fell upon sparql-p, earlier and noticed it was essentially a SPARQL query processor w/out a parser to drive it. It works over a deprecated rdflib interface: TripleStore. The newly suggested interface is Graph, which is as solid suggestion for a generic RDF:API as any. So, I wrote a 4Suite RDF model backend for rdflib, that allows the wrapping of Graph around a live 4Suite RDF model. Finally, I used this backend to execute a sparql-p query over http://http://del.icio.us/rss/chimezie:

SELECT
  ?title
WHERE {
  ?item rdf:type rss:item;
        dc:subject ?subj;
        rss:title ?title.
        FILTER (REGEX(?subj,".*rdf")).
}

The corresponding python code:

#Setup FtRDF Model
Memory.InitializeModule()   
db = Memory.GetDb('rules', 'test')
db.begin()
model = Model.Model(db)

#Parse my del.icio.us rss feed
szr = Dom.Serializer()
domStr=urllib2.urlopen('http://del.icio.us/rss/chimezie').read()        
dom = Domlette.NonvalidatingReader.parseString(domStr,'http://del.icio.us/rss/chimezie')
szr.deserialize(model,dom,scope='http://del.icio.us/rss/chimezie')

#Setup rdflib.Graph with FtRDF Model as Backend, using FtRdf driver
g=Graph(FtRdf(model))

#Setup sparql-p query processor engine
select = ("?title")

#Setup term
copia = URIRef('http://del.icio.us/chimezie')
rssTitle = URIRef('http://purl.org/rss/1.0/title')
versaWiki = URIRef('http://en.wikipedia.org/wiki/Versa')
dc_subject=URIRef("http://purl.org/dc/elements/1.1/subject")

#Filter on objects of statements (dc:subject values) - keep only those containing the string 'rdf'
def rdfSubFilter(subj,pred,obj):
    return bool(obj.find('rdf')+1)

#Execute query
where = GraphPattern([("?item",rdf_type,URIRef('http://purl.org/rss/1.0/item')),
                       ("?item",dc_subject,"?subj",rdfSubFilter),
                       ("?item",rssTitle,"?title")])    
tStore = myTripleStore(FtRdf(model))
result = tStore.query(select,where)
pprint(result)

The result (which will change daily as my links shift thru my del.icio.us channel queue:

[chimezie@Zion RDF-API]$ python FtRdfBackend.py
[u'rdflibUtils',
 u'Representing Specified Values in OWL: "value partitions" and "value sets"',
 u'Sparta',
 u'planner-rdf',
 u'RDF Template Language 1.0',
 u'SIOC Vocabulary Specification',
 u'SPARQL in RDFLib',
 u'MeetingRecords - ESW Wiki',
 u'Enumerated datatypes (OWL)',
 u'Defining N-ary Relations on the Semantic Web: Use With Individuals']

Chimezie Ogbuji

via Copia