Triclops gets a facelift, new query management capabilities, and new APIs

by Chimezie Ogbuji

I recently had a need to manage a set of queries against an OWL2 EL biomedical ontology: the Foundational Model of Anatomy. I have an open source SPARQL service implementation that I had some thoughts about extending with support for managing queries. It’s called Triclops and is part of a collection of RDF libraries and tools I have been accumulating. The name is a reference to an initial attempt to build an RDF querying and navigation interface as part of the 4Suite repository back in the day (circa 2002).

This later evolved to a very rudimentary web interface that sat in front of the Oracle 11g and MySQL/SPARQL patient dataset that Cyc’s SKSI interacted with. This was part of an interface tailored to the task of identifying patient cohorts, known as the Semantic Research Assistant (SRA). A user could dispatch handwritten SPARQL queries, browse clickable results, or return them as CSV files. This capability was only used by informaticians familiar with the structure of the RDF dataset and most investigators used the SRA.

It also implemented a RESTful protocol for ticket-based querying that was used for stopping long-running SPARQL/MySQL queries. This is not currently documented. Around the time this was committed as an Apache-licensed, Google code library, layercake-python added core support for APIs that treated remote SPARQL services as local Graph objects as well as general support for connecting SPARQL services. This was based on Ivan Herman’s excellent SPARQL Endpoint interface to Python.

Triclops (as described in the wiki) can now be configured as a “Proxy SPARQL Endpoint”. It can be deployed as a light-weight query dispatch, management, and mediation kiosk for remote and local RDF datasets. The former capability (dispatching) was already in place, the latter (mediation) can be performed using FuXi’s recent capabilities in this regard.

Specifically, FuXi includes an rdflib Store that uses its sideways-information passing (sip) strategies the in-memory SPARQL algebra implementation for use as a general-purpose framework for semweb SPARQL (OWL2-RL/RIF/N3) entailment regimes. Queries are mediated over the SPARQL protocol using global schemas captured as various kinds of semweb ontology artifacts (expressed in a simple Horn form) that describe and distinguish their predicates by those instantiated in a database (or factbase) and those derived via the semantic properties of these artifacts.

So the primary capability that remained was query management and so this recent itch was scratched over the holidays. I discovered that CodeMirror , a JavaScript library that can be used to create a relatively powerful editor interface for code, had excellent support for SPARQL. I integrated it into Triclops as an interface for managing SPARQL queries and their results. I have a running version of this at http://metacognition.info/sparql/queryMgr. Note, the service is liable to break at any point as Webfaction kills of processes that use up alot of CPU and I have yet to figure out how to configure it to restart the service when it dies in such a fashion.

The dataset this interface manages queries for is a semantic web of content comprising 3 of the primary, ancient Chinese, classical texts (the Analects, Doctrine of the Mean, and the Tao Te Ching). I record the information in RDF because it is an intuitive knowledge representation to use in capturing provenance, exposition, and other editorial meta data. Below is a screen shot of the main page listing a handful of queries, their name, last date of modification, date of last run, and number of solutions in the recent result.

Main SPARQL service page

Above the list is a syntax-highlighted text area for dispatching adhoc SPARQL queries. This is where CodeMirror is integrated. If I click on the name of the query titled “Query for Analects and the Doctrine of the Mean english chapter text (Confucius)”, I go to a similar screen with another text area whose content corresponds to the text of the query (see the screen shot below).

Main SPARQL service page

From here queries can be updated (by submitting updated CodeMirror content) or cloned (using the name field for the new copy). Alternatively, the results of previous queries can be rendered. This sends back a result document with an XSLT processing instruction that causes the browser to trigger a request for a stylesheet and render an XHTML document from content in the result document on the client side.

Finally, a query can be re-executed against a dataset, saving the results and causing the information in the first screen to show different values for the last execution run (date and number of solutions). Results can also be saved or viewed as CSV using a different stylesheet against the result document.

The last capability added is a rudimentary template system where any variable in the query or text string of the form ‘$ …. $’ is replaced with a provided string or a URI. So, I can change the pick list value on the second row of the form controls to $searchExpression$ and type “water”. This produces a SPARQL query (visible with syntax highlighting via CodeMirror) that can be used as a template to dispatch queries against the dataset.

In addition, solutions for a particular variable can be used for links, providing a small framework for configurable, navigation workflows. If I enter “[Ww]ater” in the field next to $searchExpression$, select classic from the pick list at the top of the Result navigation template area, pick “Assertions in a (named) RDF graph” from the next pick list, and enter the graphIRI variable in the subsequent text input.

Triggering this form submission will produce the result screen pictured below. As specified in the form, clicking any of the the dbpedia links for the Doctrine of the Mean will initiate the invokation of the query titled “Assertions in a (named) RDF graph”, and shown below (with the graphIRI variable pre-populated with the corresponding URI):

SELECT DISTINCT ?s ?p ?o where {
    GRAPH ?graphIRI {
      ?s ?p ?o
    }
}

Main SPARQL service page

The result of such an action is shown in the screen shot. Alternatively, a different subsequent query can be used: “Statements about a resource”. The relationship between the schema of a dataset and the factbase can be navigated in a similar way. Picking the query titled “Classes in dataset” and making the following modifications. Select “Instances of a class and graph that the statements are asserted in” from the middle pick list of the Result navigation template section. Enter ?class in the text field to the right of this. Selecting ‘Execute..’ and executing this query results in a clickable result set comprised of classes of resources and clicking any such link shows the instances of that class.

Main SPARQL service page

This latter form of navigation seems well suited for exploring datasets for which either there is no schema information in the service or it is not well known by the investigator writing the queries.

In developing this interface, at least 2 architectural principles were re-used from my SemanticDB development days: the use of XSLT on the client side to build rich, offloaded (X)HTML applications and the use of the filesystem for managing XML documents rather than a relational database. The latter (use of a filesystem) is particularly more relevant where querying across the documents is not a major requirement or even a requirement at all. The former is via setting the processing instruction of a result document to refer to a dynamically generated XSLT document on the server.

The XSLT creates a tabular, row-distinguishing, tabular interface where the links to certain columns trigger queries via a web API that takes various input, including: the variable in the current query whose solutions are ‘streamed’, a (subsequent) query specified by some function of the MD5 hash of its title, a variable in that query that is pre-populated with the corresponding solution, etc:

../query=...&action=update&innerAction=execute,templateValue=...,&valueType=uri&variable=..

Eventually, the API should probably be made more RESTful and target the query, possibly leveraging some caching mechanism in the process. Perhaps it can even work in concert with the SPARQL 1.1 Graph Store HTTP Protocol.

Python APIs for the Upper Portions of the SW Layer Cake

[by Chimezie Ogbuji]

So, I haven't written about some recent capabilities I've been developing in FuXi. Ultimately, I would like it to be part of a powerful open-source library for semantic web middle-ware development. The current incarnation of this is python-dlp.

The core inference engine ( a RETE-based production system ) has been stable for some time now. FuXi is comprised of the following major modules:

  • DLP (An implementation of the DLP transformation)
  • Horn (an API for RIF Basic Logic Dialect)
  • Rete (the inference engine)
  • Syntax (Idiomatic APIs for OWL - InfixOWL)

I've written about all but the 2nd item (more on this in another article) and the last in the above list. I've come full circle on APIs for RDF several times, but I now believe that a pure object RDF model (ORM) approach will never be as useful as a object RDF vocabulary model. I.e., an API built for a specific RDF vocablary. The Manchester OWL syntax has quickly become my syntax of choice for OWL and (at the time) I had been toying with an idea of turning it into an API.

Then one day, I had an exchange with Sandro Hawke on IRC about what a Python API for the OWL Abstract Syntax would look like. I had never taken a close look at the abstract syntax until then and immediately came away thinking something more light-weight and idiomatic would be preferable.

I came across a powerful infix operator recipe for Python:

Infix operators

I wrote an initial, standalone module called InfixOWL and put up a wiki which still serves as decent initial documentation for the syntax. I've since moved it into a proper module in FuXi, fixed some bugs and very recently added even more idiomatic APIs.

The module defines the following top-level Python classes:

  • Individual - Common class for 'type' descriptor
  • AnnotatibleTerm(Individual) - Common class for 'label' and 'comment' descriptors
  • Ontology(AnnotatibleTerm) - OWL ontology
  • Class(AnnotatibleTerm) - OWL Class
  • OWLRDFListProxy - Common class for class descriptions composed from boolean operators
  • EnumeratedClass(Class) - Class descriptions consisting of owl:oneOf
  • BooleanClass(Class,OWLRDFListProxy) - Common class for owl:intersectionOf / owl:unionOf descriptions
  • Restriction(Class) - OWL restriction
  • Property(AnnotatibleTerm) - OWL property

Example code speaks much louder than words, so below is a snippet of InfixOWL code which I used to compose (programmatically) the CPR ontology:

CPR   = Namespace('http://purl.org/cpr/0.75#')
INF   = Namespace('http://www.loa-cnr.it/ontologies/InformationObjects.owl#')
EDNS  = Namespace('http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#')
DOLCE = Namespace('http://www.loa-cnr.it/ontologies/DOLCE-Lite.owl#')
OBI   = Namespace('http://obi.sourceforge.net/ontology/OBI.owl#')
SNAP  = Namespace('http://www.ifomis.org/bfo/1.0/snap#')
SPAN  = Namespace('http://www.ifomis.org/bfo/1.0/span#')
REL   = Namespace('http://www.geneontology.org/owl#')
GALEN = Namespace('http://www.co-ode.org/ontologies/galen#')
TIME  = Namespace('http://www.w3.org/2006/time#')
CYC   = Namespace('http://sw.cyc.com/2006/07/27/cyc/')
XML   = Namespace('http://www.w3.org/2001/04/infoset#')
g = Graph()    
g.namespace_manager = namespace_manager
Class.factoryGraph = g
Property.factoryGraph = g
Ontology.factoryGraph = g

cprOntology = Ontology("http://purl.org/cpr/owl")
cprOntology.imports = ["http://obo.sourceforge.net/relationship/relationship.owl",
                       DOLCE,
                       #EDNS,
                       URIRef("http://obi.svn.sourceforge.net/viewvc/*checkout*/obi/ontology/trunk/OBI.owl"),
                       "http://www.w3.org/2006/time#"]
representationOf = Property(CPR['representation-of'],
                            inverseOf=Property(CPR['represented-by']),
                            domain=[Class(CPR['representational-artifact'])],
                            comment=[Literal("...")])
... snip ...
person = Class(CPR.person,
               subClassOf=[Class(SNAP.Object)])
... snip ...
clinician = Class(CPR.clinician)
clinician.comment=Literal("A person who plays the clinician role (typically Nurse, Physician / Doctor,etc.)")
#This expressio is equivalent to cpr:clinician rdfs:subClassOf cpr:person
person+=clinician    
.. snip ..
patientRecord = Class(CPR['patient-record'])
patientRecord.comment=Literal("an electronic document (a representational artifact [REFTERM])  "+
                               "which captures clinically relevant data about a specific patient and "+
                               " is primarily comprised of one or more cpr:clinical-descriptions.")
patientRecord.seeAlso = URIRef("http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf")
patientRecord.subClassOf = \
    [bytes,
     #Class(CYC.InformationBearingThing),
     CPR['representation-of'] |only| patient,
     REL.OBO_REL_has_proper_part |some| Class(CPR['clinical-description'])]
... snip ...
problemDisj=Class(CPR['pathological-state']) | organismalOccurrent | extraOrganismalOccurrent
problem = Class(CPR['medical-problem'],
                subClassOf=[problemDisj,
                            realizedBy|only|Class(CPR['screening-act']),                                
                            DOLCE['has-quality'] |some| owlTimeQuality])

After the OWL graph is composed, it can be serialized by simply invoking:

g.serialize(format='pretty-xml')

[Chimezie Ogbuji]

via Copia

FuXi: Becoming a Full-fledged Logical Reasoning System

[by Chimezie Ogbuji]

I've been doing alot of "Google reading" lately

Completing Logical Reasoning System Capabilities

With the completion (or near completion) of PML-generating capabilities for FuXi, it is becoming a fully functional logical reasoning system. In "(Artificial Intelligence: A Modern Approach)" Stuart Russel and Peter Norvig identify the following main categories for automated reasoning systems:

  1. Theorem provers
  2. Production systems
  3. Frame systems and semantic networks
  4. Description Logic systems

OWL and RDF are coverage for 3 and 4. The second category is functionally covered by the RETE-UL algorithm FuXi employs (a highly efficient modification of the original RETE algorithm). The currently developing RIF Basic Logic Dialect covers 2 - 4. Proof Markup Language covers 1. Now, FuXi can generate (and export visualization diagrams) Proof Markup Language (PML) structures. I still need to do more testing, and hope to be able to generate proofs for each of the OWL tests. Until then, below is a diagram of the proof tree generated from the "She's a Witch and I have Proof" test case:

# http://clarkparsia.com/weblog/2007/01/02/burn-the-witch/
# http://www.netfunny.com/rhf/jokes/90q4/burnher.html
@prefix : <http://www.w3.org/2000/10/swap/test/reason/witch#>.
@keywords is, of, a.
#[1]    BURNS(x) /\ WOMAN(x)            =>      WITCH(x)
{ ?x a BURNS. ?x a WOMAN } => { ?x a WITCH }.
#[2]    WOMAN(GIRL)
GIRL a WOMAN.
#[3]    \forall x, ISMADEOFWOOD(x)      =>      BURNS(x)
{ ?x a ISMADEOFWOOD. } => { ?x a BURNS. }.
#[4]    \forall x, FLOATS(x)            =>      ISMADEOFWOOD(x)
{ ?x a FLOATS } => { ?x a ISMADEOFWOOD }.
#[5]    FLOATS(DUCK)
DUCK a FLOATS.
#[6]    \forall x,y FLOATS(x) /\ SAMEWEIGHT(x,y) =>     FLOATS(y)
{ ?x a FLOATS. ?x SAMEWEIGHT ?y } => { ?y a FLOATS }.
# and, by experiment
# [7]   SAMEWEIGHT(DUCK,GIRL)

Shes a witch and I have proof trace

There is another test case of the "Dan's home region is Texas" test case on a python-dlp Wiki: DanHomeProof:

@prefix : <gmpbnode#>.
@keywords is, of, a.
dan home [ in Texas ].
{ ?WHO home ?WHERE.
  ?WHERE in ?REGION } => { ?WHO homeRegion ?REGION }.

Dans home region is Texas proof

I decided to use PML structures since there are a slew of Stanford tools which understand / visualize it and I can generate other formats from this common structure (including the CWM reason.n3 vocabulary). Personally, I prefer the proof visualization to the typically verbose step-based Q.E.D. proof.

Update: I found nice write-up on the CWM-based reason ontology and translations to PML

So, how does a forward-chaining production rule system generate proofs that are really meant for backward chaining algorithms? When the FuXi network is fed initial assertions, it is told what the 'goal' is. The goal is a single RDF statement which is being prooved. When the forward-chaining results in a inferred triple which matches the goal, it terminates the RETE algorithm. So, depending on the order of the rules and the order that the initial facts are fed it will be (for the general cases) less efficient than a backward chaining algorithm. However, I'm hoping the blinding speed of the fully hashed RETE-UL algorithm makes up the difference.

I've been spending quite a bit of time on FuXi mainly because I am interested in empirical evidence which supports a school of thought which claims that Description Logic based inference (Tableaux-based inference) will never scale as well the Logic Programming equivalent - at least for certain expressive fragments of Description Logic (I say expressive because even given the things you cannot express in this subset of OWL-DL there is much more in Horn Normal Form (and Datalog) that you cannot express even in the underlying DL for OWL 1.1). The genesis of this is a paper I read, which lays out the theory, but there was no practice to support the claims at the time (at least that I knew of). If you are interested in the details, the paper is "Description Logic Programs: Combining Logic Programs with Description Logic" and written by many people who are working in the Rule Interchange Format Working Group.

It is not light reading, but is complementary to some of Bijan's recent posts about DL-safe rules and SWRL.

A follow-up is a paper called "A Realistic Architecture for the Semantic Web" which builds on the DLP paper and makes claims that the current OWL (Description Logic-based) Semantic Web inference stack is problematic and should instead be stacked ontop of Logic Programming since Logic Programming algorithm has a much richer and pervasively deployed history (all modern relational databases, prolog, etc..)

The arguments seem sound to me, so I've essentially been building up FuXi to implement that vision (especially since it employes - arguably - the most efficient Production Rule inference algorithm). The final piece was a fully-functional implementation of the Description Horn Logic algorithm. Why is this important? The short of it is that the DLP paper outlines an algorithm which takes a (constrained) set of Description Logic expressions and converts them to 'pure' rules.

Normally, Logic Programming N3 implementations pass the OWL tests by using a generic ruleset which captures a subset of the OWL DL semantics. The most common one is owl-rules.n3. DLP flips the script by generating a rule-set specifically for the original DL, instead of feeding OWL expressions into the same network. This allows the RETE-UL algorithm to create an even more efficient network since it will be tailored to the specific bits of OWL.

For instance, where I used to run through the OWL tests in about 4 seconds, I can now pass them in about 1 secound using. Before I would setup a RETE network which consisted of the generic ruleset once and run the tests through it (resetting it each time). Now, for each test, I create a custom network, evaluate the OWL test case against it. Even with this extra overhead, it is still 4 times faster! The custom network is trivial in most cases.

Ultimately I would like to be able to use FuXi for generic "Semantic Web" agent machinery and perhaps even to try some that programming-by-proof thing that Dijkstra was talking about.

Chimezie Ogbuji

via Copia

A Content Repository API for Rich, Semantic Web Applications?

[by Chimezie Ogbuji]

I've been working with roll-your-own content repositories long enough to know that open standards are long overdue.

The slides for my Semantic Technology Conference 2007 session are up: "Tools for the Next Generation of CMS: XML, RDF, & GRDDL" (Open Office) and (Power point)

This afternoon, I merged documentation of the 4Suite repository from old bits (and some new) into a Wiki that I hope to contribute to (every now and then).
I think there is plenty of mature, supporting material upon which a canon of best practices for XML/RDF CMSes can be written, with normative dependencies on:

  • GRDDL
  • XProc
  • Architecture of the World Wide Web, Volume One
  • URI RFCs
  • Rich Web Application Backplane
  • XML / XML Base / XML Infoset
  • RDDL
  • XHTML 1.0
  • SPARQL / Versa (RDF querying)
  • XPath 2.0 (JSR 283 restriction) with 1.0 'fallback'
  • HTTP 1.0/1.1, ACL-based HTTP Basic / Digest authentication, and a convention for Web-based XSLT invokation
  • Document/graph-level ACL granularity

The things that are missing:

  • RDF equivalent of DOM Level 3 (transactional, named graphs, connection management, triple patterns, ... ) with events.
  • A mature RIF (there is always SWRL, Notation 3, and DLP) as a framework for SW expert systems (and sentient resource management)
  • A RESTful service description to complement the current WSDL/SOAP one

For a RESTful service description, RDF Forms can be employed to describe transport semantics (to help with Agent autonomy), or a mapping to the Atom Publishing Protocol (and a thus a subset of GData) can be written.

In my session, I emphasized how closely JSR 283 overlaps with the 4Suite Repository API.

The delta between them mostly has to do with RDF, other additional XML processing specifications (XUpdate, XInclude, etc.), ACL-based HTTP authentication (basic, and sessions), HTTP/XSLT bindings, and other miscellaneous bells and whistles

Chimezie Ogbuji

via Copia

Compositional Evaluation of W3C SPARQL Algebra via Reduce/Map

[by Chimezie Ogbuji]

Committed to svn

<CIA-16> chimezie * r1132 rdflib/sparql/ (Algebra.py bison/Processor.py bison/SPARQLEvaluate.py): Full implementation of the W3C SPARQL Algebra. This should provide coverage for the full SPARQL grammar (including all combinations of GRAPH). Includes unit testing and has been run against the old DAWG testsuite.

Tested against older DAWG testsuite. Implemented using functional programming idioms: fold (reduce) / unfold (map)

Does that suggest parallelizable execution?

reduce(lambda left,right: ReduceToAlgebra(left,right),{ .. triple patterns .. } => expression

expression -> sparql-p -> solution mappings

GRAPH ?var / <.. URI ..> support as well.

The only things outstanding (besides the new modifiers and non-SELECT query forms), really, are:

  • a pluggable extension mechanism
  • support for an exploratory protocol
  • a way for Fuxi to implement entailment.
  • other nice-to-haves..

.. Looking forward to XTech 2007 and Semantic Technology Conference '07

Chimezie Ogbuji

via Copia

Comprehensive RDF Query API's for RDFLib

[by Chimezie Ogbuji]

RDFLib's support for SPARQL has come full circle and I wasn't planning on blogging on the developments until they had settled some – and they have. In particular, the last piece was finalizing a set of APIs for querying and result processing that fit well within the framework of RDFLib's various Graph API's. The other issue was for the query APIs to accomodate eventual support for other querying languages (Versa for instance) that are capable of picking up the slack where SPARQL is wanting (transitive closures, for instance – try composing a concise SPARQL query for calculating the transitive closure of a given node along the rdfs:subClassOf property and you'll immediately see what I mean).

Querying

Every Graph instance now has a query method through which RDF queries can be dispatched:

def query(self, strOrQuery, initBindings={}, initNs={}, DEBUG=False, processor="sparql")
    """
    Executes a SPARQL query (eventually will support Versa queries with same method) against this Conjunctive Graph
    strOrQuery - Is either a string consisting of the SPARQL query or an instance of rdflib.sparql.bison.Query.Query
    initBindings - A mapping from variable name to an RDFLib term (used for initial bindings for SPARQL query)
    initNS - A mapping from a namespace prefix to an instance of rdflib.Namespace (used for SPARQL query)
    DEBUG - A boolean flag passed on to the SPARQL parser and evaluation engine
    processor - The kind of RDF query (must be 'sparql' until Versa is ported)
    """

The first argument is either a query string or a pre-compiled query object (compiled using the appropriate BisonGen mechanism for the target query language). Pre-compilation can be useful for avoiding redundant parsing overhead for queries that need to be evaluated repeatedly:

from rdflib.sparql.bison import Parse
queryObject = Parse(sparqlString)

The next argument (initBindings) is dictionary that maps variables to their values. Though variables are common to both languages, SPARQL variables differ from Versa queries in that they are string terms in the form “?varName”, wherease in Versa variables are QNames (same as in Xpath). For SPARQL queries the dictionary is expected to be a mapping from variables to RDFLib terms. This is passed on to the SPARQL processor as initial variable bindings.

initNs is yet another top-level parameter for the query processor: a namespace mapping from prefixes to namespace URIs.

The debug flag is pretty self explanatory. When True, it will cause additional print statements to appear for the parsing of the query (triggered by BisonGen) as well as the patterns and constraints passed on to the processor (for SPARQL queries).

Finally, the processor specifies which kind of processor to use to evaluate the query: 'versa' or 'sparql'. Currently (with emphasis on 'currently'), only SPARQL is supported.

Result formats

SPARQL has two result formats (JSON and XML). Thanks to Ivan Herman's recent contribution the SPARQL processor now supports both formats. The query method (above) returns instances of QueryResult, a common class for RDF query results which define the following method:

def serialize(self,format='xml')

The format argument determines which result format to use. For SPARQL queries, the allowable values are: 'graph' – for CONSTRUCT / DESCRIBE queries (in which case a resulting Graph object is returned), 'json',or 'xml'. The resulting object also behaves as an iterator over the bindings for manipulation in the host language (Python).

Versa has it's own set of result formats as well. Primarily there is an XML result format (see: Versa by Example) as well as Python classes for the various internal datatypes: strings,resources,lists,sets,and numbers. So, the eventual idea is for using the same method signature for serializing Versa queries as XML as you would for SPARQL queries.

SPARQL Limitations

The known SPARQL query forms that aren't supported are:

  • DESCRIBE/CONSTRUCT (very temporary)
  • Nested Group Graph Patterns
  • Graph Patterns can only be used (once) by themselves or with OPTIONAL patterns
  • UNION patterns can only be combined with OPTIONAL patterns
  • Outer FILTERs which refer to variables within an OPTIONAL

Alot of the above limitations can be addressed with formal equivalence axioms of SPARQL semantics, such as those mentioned in the recent paper on the complexity and semantics of SPARQL. Since there is very little guidance on the semantics of SPARQL I was left with the option of implementing only those equivalences that seemed obvious (in order to support the patterns in the DAWG test suite):

1) { patternA { patternB } } => { patternA. patternB }
2) { basicGraphPatternA OPTIONAL { .. } basicGraphPatternB }
  =>
{ basicGraphPatternA+B OPTIONAL { .. }}

It's still a work in progress but has come quite a long way. CONSTRUCT and DESCRIBE are already supported by the underlying processor, I just need to find some time to hook it up to the query interfaces. Time is something I've been real short of lately.

Chimezie Ogbuji

via Copia

SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the setup.py had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

  • SPARQL.bgen (the main file that includes the others)
  • SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
  • SPARQLTokens.bgen.frag (Token definitions)
  • SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
  • SPARQLLexerPatterns.bgen.frag (lexer definition)
  • SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
  • SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

  • I wanted to get it through parsing the queries in the test case
  • Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
  • Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

  • Test round-tripping of SPARQL queries
  • Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia

Moving FuXi onto the Development Track

[by Chimezie Ogbuji]

I was recently prompted to consider updating FuXi to use the more recent CVS versions of both Pychinko and rdflib. In particular, I've been itching to get Pychinko working with the new rdflib API – which (as I've mentioned) has had it's API updated significantly to support (amongst other things) support for Notation 3 persistence.

Currently, FuXi works with frozen versions of cwm, rdflib, and Pychiko.

I personally find it more effective to work with reasoning capabilities within the context of a querying language than as a third party software library. This was the original motivation for creating FuXi. Specifically, the process of adding inferred statements, dispatching a prospective query and returning the knowledge base to it's original state is a perfect compromise between classic backward / forward chaining.

It frees up both the query processor and persistence layer from the drudgery of logical inference – a daunting software requirement in its own right. Of course, the price paid in this case is the cumbersome software requirements.

It's well worth noting that such on-demand reasoning also provides a practical way to combat the scalability limitations of RDF persistence.

To these ends, I've updated FuXi to work with the current (CVS) versions of rdflib, 4Suite RDF, and pychinko. It's essentially a re-write and provides 3 major modules:

  • FuXi.py (the core component – a means to fire the pychinko interpreter with facts and rules from rdflib graphs)
  • AgentTools.py (provides utility functions for the parsing and scuttering of remote graphs)
  • VersaFuXiExtensions.py (defines Versa extension functions which provide scutter / reasoning capabilities)

Versa Functions:

reason(expr)

This function takes a Versa expression as a string and evaluates it after executing FuXi using any rules associated with the current graph (via a fuxi:ruleBase property). FuXi (and Pychinko, consequently) use the current graph (and any graphs associated by rdfs:isDefinedBy or rdfs:seeAlso) as the set of facts against which the rules are fired.

class(instances)

This function returns the class(es) – rdfs:Class or owl:Class – of the given list of resources. If the current graph has already been extended to include inferred statements (via the reason function, perhaps), it simply returns the objects of all rdf:type statements made against the resources. Otherwise, it registers, compiles, and fires a set of OWL/RDFS rules (a reasonable subset of owl-rules.n3 and rdfs-rules.n3 bundled with Euler) against the current graph (and any associated graphs) before matching classes to return.

type(klasses)

This essentially overrides the default 4Suite RDF implementation of this 'built-in' Versa function which attempts to apply RDFS entailment rules in brute force fashion. It behaves just like class with the exception that it returns instances of the given classes instead (essentially it performs the reverse operation).

scutter(url,expr,steps=5)

This function attempts to apply some best practices in the interpretation of a network of remote RDF graphs. In particular it uses content negotiation and Scutter principles to parse linked RDF graphs (expressed in either RDF/XML or Notation 3). The main use case for this function (and the primary motivation for writing it) is identity-reasoning within a remsotely-hosted set of RDF Graphs (FOAF smushing for example)

The FuXi software bundle includes a short ontology documenting the two RDF terms: one is used to manage the automated association of a rule base with a graph and the other identifies a graph that has been expanded by inference.

I have yet to write documentation, so this piece essentially attempts to serve that purpose, however included in the bundle are some unittest cases for each of the above functions. It works off a small set of initial facts.

Unfortunately, a majority of the aforementioned software requirement liability has to do with Pychinko's reliance on the SWAP code base. Initially, I began looking for a functional subset to bundle but later decided it was against the spirit of the combined body of work. So, until a better solution surfaces, the SWAP code can be checked out from CVS like so (taken from ):

$ cvs -d:pserver:anonymous@dev.w3.org:/sources/public login
password? anonymous
$ cvs -d:pserver:anonymous@dev.w3.org:/sources/public get 2000/10/swap

The latest 4Suite CVS snapshot can be downloaded from ftp://ftp.4suite.org/pub/cvs-snapshots/4Suite-CVS.tar.gz,
Pychinko can be retrieved from the Mindswap svn repository, and rdflib can also be retrieved from it's svn repository.

Chimezie Ogbuji

via Copia

Store-agnostic REGEX Matching and Thread-safe Transactional Support in rdflib

[by Chimezie Ogbuji]

rdflib now has (checked into svn trunk) support for REGEX matching of RDF terms and thread-safe transactional support. The transactional wrapper provides Atomicity, Isolation, but not Durability (a list of reversal RDF operations is stored on the live instance - so they won't survive a system failure). The store implementation is responsible for Consistency.

The REGEX wrapper provides a REGEXTerm which can be used in any of the RDF term 'slots' with:

It replaces any REGEX term with a wildcard (None) and performs the REGEX match after the query invokation is dispatched to the store implementation it is wrapping.

Both are meant to work with a live instance of an RDF Store, but behave as a proxy for the store (providing REGEX and/or transactional support).

For example:

from rdflib.Graph import ConjunctiveGraph, Graph
from rdflib.store.REGEXMatching import REGEXTerm, REGEXMatching
from rdflib.store.AuditableStorage import AuditableStorage
from rdflib.store import Store
from rdflib import plugin, URIRef, Literal, BNode, RDF

store = plugin.get('IOMemory',Store)()
regexStorage = REGEXMatching(store)
txRegex =  AuditableStorage(regexStorage)
g=Graph(txRegex,identifier=URIRef('http://del.icio.us/rss/chimezie'))
g.load("http://del.icio.us/rss/chimezie")
print len(g),[t for t in g.triples((REGEXTerm('.*zie$'),None,None))]
g.rollback()
print len(g),[t for t in g]

Results in:

492 [(u'http://del.icio.us/chimezie', u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', u'http://purl.org/rss/1.0/channel'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/link', u'http://del.icio.us/chimezie'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/items', u'QQxcRclE1'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/description', u''), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/title', u'del.icio.us/chimezie')] 0 []

[Chimezie Ogbuji]

via Copia

Addressing the RDF Scalability Bottleneck

[by Chimezie Ogbuji]

I've been building RDF persistence stores for some time (it's gone from something of a hobby to the primary responsibility in my current work capacity) and have come to the conclusion that RDF stores will almost always be succeptible to the physical limitations of database scalability.

I recall when I was at the Semantic Technology Conference this spring and asked one of the presenters there what he thought about this problem that all RDF databases face and the reason why most don't function effectively beyond 5-10 million triples. I liked his answer:

It's an engineering problem

Consider the amount of information an adult has stored (by whatever mechanism the human brain uses to persist information) in his or her noggin. We often take it for granted - as we do all other aspects of biology we know very little about - but it's worth considering when thinking about why scalability is a ubiquitous hurdle for RDF databases.

Some basic Graph theory is relevant to this point:

The size of a graph is the number of edges and the order of a graph is the number of nodes within the graph. RDF is a Resource Description Framework (where what we know about resources is key - not so much the resouces themselves) so it's not surprising that RDF graphs will almost always have a much larger size than order. It's also not suprising that most performance analysis made across RDF implementations (such as LargeTripleStores for instance) focus mostly on triple size.

I've been working on a SQL-based persistence schema for RDF content (for rdflib) that is a bit different from the standard approaches taken by most RDBMS implementations of RDF stores I'm familiar with (including those I've written). Each of the tables are prefixed with a SHA1-hashed digest of the identifier associated with the 'localized universe' (AKA,the boundary for a closed world assumptions). The schema is below:

CREATE TABLE %s_asserted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,    
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),    
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

 CREATE TABLE %s_type_statements (
     member        text not NULL,
     klass         text not NULL,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     INDEX termComb_index (termComb),
     INDEX memberC_index (member(100),klass(100),context(50)),
     INDEX klassC_index (klass(100),context(50)),
     INDEX c_index (context(10))) TYPE=InnoDB"""

 CREATE TABLE %s_quoted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

The first thing to note is that statements are partitioned into logical groupings:

  • Asserted non rdf:type statements: where all asserted RDF statements where the predicate isn't rdf:type are stored
  • Asserted rdf:type statements: where all asserted rdf:type statements are stored
  • Quoted statements: where all quoted / hypothetical statements are stored

Statement quoting is a Notation 3 concept and an extension of the RDF model for this purpose. The most significant partition is the rdf:type grouping. The idea is to have class membership modeled at the store level instead of at a level above it. RDF graphs are as different as the applications that use them but the primary motivating factor for making this seperation was the assumption that in most RDF graphs a majority of the statements (or a significant portion) would consist of rdf:type statements (statements of class membership).

Class membership can be considered an unstated RDF modelling best practice since it allows an author to say alot about a resource simply by associating it with a class that has it's semantics completely spelled out in a separate, supporting ontology.

The rdf:type table models class membership explicitely with two columns: klass and member. This results in a savings of 43 characters per rdf:type statement. The implementation takes note of the predicate submitted in triple-matching pattern and determines which tables to search.
Consider the following triple pattern:

http://metacognition.info ?predicate ?object

The persistence layer would know it needs to check against the table that persists non rdf:type statements as well as the class membership table. However, patterns that match against a specific predicate (other than rdf:type) or class membership queries only need to check within one partition (or table):

http://metacognition.info rdf:type ?klass

In general, I've noticed that being able to partition your SQL search space (searching within a named graph / context or searching within a single table) goes along way in query response.

The other thing worth noting is the termComb column, which is an integer value representing the 40 unique ways the following RDF terms could appear in a triple:

  • URI Ref
  • Blank Node
  • Formula
  • Literal
  • Variable

I'm certain there are many other possible optimizations that can be made in a SQL schema for RDF triple persistence (there isn't much precedent in this regard - and Oracle has only recently joined the foray) but partitioning rdf:type statements seperately is one such thought I've recently had.

[Uche Ogbuji]

via Copia