Python APIs for the Upper Portions of the SW Layer Cake

[by Chimezie Ogbuji]

So, I haven't written about some recent capabilities I've been developing in FuXi. Ultimately, I would like it to be part of a powerful open-source library for semantic web middle-ware development. The current incarnation of this is python-dlp.

The core inference engine ( a RETE-based production system ) has been stable for some time now. FuXi is comprised of the following major modules:

  • DLP (An implementation of the DLP transformation)
  • Horn (an API for RIF Basic Logic Dialect)
  • Rete (the inference engine)
  • Syntax (Idiomatic APIs for OWL - InfixOWL)

I've written about all but the 2nd item (more on this in another article) and the last in the above list. I've come full circle on APIs for RDF several times, but I now believe that a pure object RDF model (ORM) approach will never be as useful as a object RDF vocabulary model. I.e., an API built for a specific RDF vocablary. The Manchester OWL syntax has quickly become my syntax of choice for OWL and (at the time) I had been toying with an idea of turning it into an API.

Then one day, I had an exchange with Sandro Hawke on IRC about what a Python API for the OWL Abstract Syntax would look like. I had never taken a close look at the abstract syntax until then and immediately came away thinking something more light-weight and idiomatic would be preferable.

I came across a powerful infix operator recipe for Python:

Infix operators

I wrote an initial, standalone module called InfixOWL and put up a wiki which still serves as decent initial documentation for the syntax. I've since moved it into a proper module in FuXi, fixed some bugs and very recently added even more idiomatic APIs.

The module defines the following top-level Python classes:

  • Individual - Common class for 'type' descriptor
  • AnnotatibleTerm(Individual) - Common class for 'label' and 'comment' descriptors
  • Ontology(AnnotatibleTerm) - OWL ontology
  • Class(AnnotatibleTerm) - OWL Class
  • OWLRDFListProxy - Common class for class descriptions composed from boolean operators
  • EnumeratedClass(Class) - Class descriptions consisting of owl:oneOf
  • BooleanClass(Class,OWLRDFListProxy) - Common class for owl:intersectionOf / owl:unionOf descriptions
  • Restriction(Class) - OWL restriction
  • Property(AnnotatibleTerm) - OWL property

Example code speaks much louder than words, so below is a snippet of InfixOWL code which I used to compose (programmatically) the CPR ontology:

CPR   = Namespace('')
INF   = Namespace('')
EDNS  = Namespace('')
DOLCE = Namespace('')
OBI   = Namespace('')
SNAP  = Namespace('')
SPAN  = Namespace('')
REL   = Namespace('')
GALEN = Namespace('')
TIME  = Namespace('')
CYC   = Namespace('')
XML   = Namespace('')
g = Graph()    
g.namespace_manager = namespace_manager
Class.factoryGraph = g
Property.factoryGraph = g
Ontology.factoryGraph = g

cprOntology = Ontology("")
cprOntology.imports = ["",
representationOf = Property(CPR['representation-of'],
... snip ...
person = Class(CPR.person,
... snip ...
clinician = Class(CPR.clinician)
clinician.comment=Literal("A person who plays the clinician role (typically Nurse, Physician / Doctor,etc.)")
#This expressio is equivalent to cpr:clinician rdfs:subClassOf cpr:person
.. snip ..
patientRecord = Class(CPR['patient-record'])
patientRecord.comment=Literal("an electronic document (a representational artifact [REFTERM])  "+
                               "which captures clinically relevant data about a specific patient and "+
                               " is primarily comprised of one or more cpr:clinical-descriptions.")
patientRecord.seeAlso = URIRef("")
patientRecord.subClassOf = \
     CPR['representation-of'] |only| patient,
     REL.OBO_REL_has_proper_part |some| Class(CPR['clinical-description'])]
... snip ...
problemDisj=Class(CPR['pathological-state']) | organismalOccurrent | extraOrganismalOccurrent
problem = Class(CPR['medical-problem'],
                            DOLCE['has-quality'] |some| owlTimeQuality])

After the OWL graph is composed, it can be serialized by simply invoking:


[Chimezie Ogbuji]

via Copia

Comprehensive RDF Query API's for RDFLib

[by Chimezie Ogbuji]

RDFLib's support for SPARQL has come full circle and I wasn't planning on blogging on the developments until they had settled some – and they have. In particular, the last piece was finalizing a set of APIs for querying and result processing that fit well within the framework of RDFLib's various Graph API's. The other issue was for the query APIs to accomodate eventual support for other querying languages (Versa for instance) that are capable of picking up the slack where SPARQL is wanting (transitive closures, for instance – try composing a concise SPARQL query for calculating the transitive closure of a given node along the rdfs:subClassOf property and you'll immediately see what I mean).


Every Graph instance now has a query method through which RDF queries can be dispatched:

def query(self, strOrQuery, initBindings={}, initNs={}, DEBUG=False, processor="sparql")
    Executes a SPARQL query (eventually will support Versa queries with same method) against this Conjunctive Graph
    strOrQuery - Is either a string consisting of the SPARQL query or an instance of rdflib.sparql.bison.Query.Query
    initBindings - A mapping from variable name to an RDFLib term (used for initial bindings for SPARQL query)
    initNS - A mapping from a namespace prefix to an instance of rdflib.Namespace (used for SPARQL query)
    DEBUG - A boolean flag passed on to the SPARQL parser and evaluation engine
    processor - The kind of RDF query (must be 'sparql' until Versa is ported)

The first argument is either a query string or a pre-compiled query object (compiled using the appropriate BisonGen mechanism for the target query language). Pre-compilation can be useful for avoiding redundant parsing overhead for queries that need to be evaluated repeatedly:

from rdflib.sparql.bison import Parse
queryObject = Parse(sparqlString)

The next argument (initBindings) is dictionary that maps variables to their values. Though variables are common to both languages, SPARQL variables differ from Versa queries in that they are string terms in the form “?varName”, wherease in Versa variables are QNames (same as in Xpath). For SPARQL queries the dictionary is expected to be a mapping from variables to RDFLib terms. This is passed on to the SPARQL processor as initial variable bindings.

initNs is yet another top-level parameter for the query processor: a namespace mapping from prefixes to namespace URIs.

The debug flag is pretty self explanatory. When True, it will cause additional print statements to appear for the parsing of the query (triggered by BisonGen) as well as the patterns and constraints passed on to the processor (for SPARQL queries).

Finally, the processor specifies which kind of processor to use to evaluate the query: 'versa' or 'sparql'. Currently (with emphasis on 'currently'), only SPARQL is supported.

Result formats

SPARQL has two result formats (JSON and XML). Thanks to Ivan Herman's recent contribution the SPARQL processor now supports both formats. The query method (above) returns instances of QueryResult, a common class for RDF query results which define the following method:

def serialize(self,format='xml')

The format argument determines which result format to use. For SPARQL queries, the allowable values are: 'graph' – for CONSTRUCT / DESCRIBE queries (in which case a resulting Graph object is returned), 'json',or 'xml'. The resulting object also behaves as an iterator over the bindings for manipulation in the host language (Python).

Versa has it's own set of result formats as well. Primarily there is an XML result format (see: Versa by Example) as well as Python classes for the various internal datatypes: strings,resources,lists,sets,and numbers. So, the eventual idea is for using the same method signature for serializing Versa queries as XML as you would for SPARQL queries.

SPARQL Limitations

The known SPARQL query forms that aren't supported are:

  • DESCRIBE/CONSTRUCT (very temporary)
  • Nested Group Graph Patterns
  • Graph Patterns can only be used (once) by themselves or with OPTIONAL patterns
  • UNION patterns can only be combined with OPTIONAL patterns
  • Outer FILTERs which refer to variables within an OPTIONAL

Alot of the above limitations can be addressed with formal equivalence axioms of SPARQL semantics, such as those mentioned in the recent paper on the complexity and semantics of SPARQL. Since there is very little guidance on the semantics of SPARQL I was left with the option of implementing only those equivalences that seemed obvious (in order to support the patterns in the DAWG test suite):

1) { patternA { patternB } } => { patternA. patternB }
2) { basicGraphPatternA OPTIONAL { .. } basicGraphPatternB }
{ basicGraphPatternA+B OPTIONAL { .. }}

It's still a work in progress but has come quite a long way. CONSTRUCT and DESCRIBE are already supported by the underlying processor, I just need to find some time to hook it up to the query interfaces. Time is something I've been real short of lately.

Chimezie Ogbuji

via Copia

Extension Functionality and Set Manipulation in RDF Query Languages

A recent bit by Andy Seaborne (on Property Functions in ARQ – Jena's query engine) got me thinking about general extension mechanisms for RDF querying languages.
In particular, he mentions two extensions that provide functionality for processing RDF lists and collections which (ironically) coincide with functions I had requested be considered for later generations of Versa.

The difference, in my case, was that the suggestions were for functions that cast RDF lists into Versa lists (or sets) – which are data structures native to Versa that can be processed with certain built-in functions.

Two other extensions I use quite often in Versa (and mentioned briefly in my article) are scope, and scoped-subquery. These have to do with identifying the context of a resource and limiting the evaluation of a query to within a named graph, respectively. Currently, the scoped function returns a list of the names of all graphs in which the resource is asserted as a member of any class (via rdf:type). I could imagine this being expanded to include the names of any graph in which statements about the resource are asserted. scoped-subquery doesn't present much value for a host language that can express queries as limited to a named context.

I also had some thoughts about an extension function mechanism that allowed an undefined function reference (for functions of arity 1 – i.e. functions that take only a single argument) to be interpreted as a request for all the objects of statements where the predicate is the function URI and the subject is the argument

I recently finished writing a SPARQL grammar for BisonGen and hope to conclude that effort (at some point) once I get over a few hurdles. I was pleasantly surprised to find that the grammar for function invocation is pretty much identical for both query languages. Which suggests that there is room for some thought about a common mechanism (or just a common set of extension functionality – similar to the EXSLT effort) for RDF querying or general processing.

CWM has a rich, and well documented set of RDF extensions. The caveat is that the method signatures are restricted to dual input (subject and object) since the built-ins are expressed as RDF triples where the predicate is the name of the function and the subject and object are arguments to is. Nevertheless, it is a good source from which an ERDF specification could be drafted.

My short wish-list of extension functions in such an effort would include:

  • List comprehension (intersection, union, difference, membership, indexing, etc.)
  • Resolving the context for a given node: context(rdfNode) => URI (or BNode?)
  • an is-a(resource) function (equivalent to Versa's type function without the entailment hooks)
  • a class(resource) which is an inverse of is-a
  • Functions for transitive closures and/or traversals (at the very least)
  • A fallback mechanism for undefined functions that allowed them to be interpreted as unary 'predicate functions'

Of course, functions which return lists instead of single resources would be problematic for any host language that can't process lists, but it's just some food for thought.

Chimezie Ogbuji

via Copia

Closed World Assumptions, Conjunctive Querying, and Oracle 10g

I promised myself I would write at least one entry related to my experience at the 2006 Semantic Technology Conference here in San Jose, which has been an incredibly well attended and organized conference. I've found myself wanting to do less talking and more problem solving lately, but I came across an issue that has generated the appropriate amount of motivation.

For some time I've been (eagerly) monitoring Oracle's recent advances with their latest release (10g R2) which (amongst other things) introduced (in my estimation) what will turn out to be a major step in bridging the gap between the academic dream of the Semantic Web and the reality of the day-to-day problems that are relevant to technologies in that sphere of influence.

But first things first (as the saying goes). Basically, the Oracle 10g R2 RDF implementation supports the logical separation of RDF triples into named Models as well as the ability to query across explict sets of Models. However, the querying mechanism (implemented as an extension to SQL – SDO_RDF_MATCH) doesn't support the ability to query across the entire fact / knowledge base – i.e., the aggregation of all the named Models contained within.

I like to refer to this kind of a query as a Conjunctive Query. The term isn't mine, but it has stuck, and has made its way into the rdflib Store API. In fact, the rdflib API now has the concept of a Conjunctive Graph which behaves like a named graph with the exception that the query space is the set of all named graphs in the knowledge base.

Now, it would be an easy nitpick to suggest that since the formal RDF Model doesn't provide any guidance on the separation of RDF triples into addressable graphs, implementors can not be held at fault for deciding not to support such a separation. However, the large body of literature on Named Graphs as well as the support for querying within named sets of graphs in the more contemporary RDF querying languages does suggest that there is real value in separating raw triples this way and in being able to query across these logical separations transparently.

I think the value is twofold: Closed World Assumptions and query performance. Now, the notion of a boundary of known facts, will probably raise a red flag amongst semantic web purists and some may suggest that closed world assumptions cut against the grain of a vision of a massively distributed expert system. For the uninitiated, open world assumptions are where the absence of an assertion in your fact base does not necessarily suggest that the assertion (or statement) isn't true. That is, if the statement 'the sky is blue' is not in the knowledge base, you can not automatically assume that the sky is not blue.

This limitation makes sense where the emphasis is on the distribution of data (a key component of the semantic web vision), however it essentially abandons the value in applying the formal semantics of RDF (and knowledge representation, in general) to closed systems – systems where the data is complete to a certain extent and makes sense to live in a query silo.

The most practical example I can think of is the one I work with daily: medical research data that is often subjected to statistical analysis for deducing trends. You can't make suggestions derived from statistical trends in your data if you don't have some minimal confidence that the set of data you are working with is 'complete' enough to answer the questions you set out to ask.

Closed world assumptions also open the door to other heuristic optimizations that are directly relevant to query processors.

Finally, where RDF databases are built on top of SQL stores, being able to partition your query space into an additional indexable constraint (I say additional, because there are other partitioning techniques that impact scalability and response) makes a world of difference in a framework that has already been rigorously tuned to take maximal advantage of such rectangular partitioning. To a SQL store implementing an RDF model, the name (or names) of a graph is a low cardinality, indexable, constraint (there will always be less graphs than total triples) that can be the difference of several orders of magnitude in the overall query response time.

Named contexts lend themselves quite well to two-part queries where the first part identifies a set of named graphs (within a conjunctive graph or known universe) that match a certain criteria and then query only within those matching graphs. Once the query resolver has identified the named graphs, the second part of the query can be dispatched in a very targeted fashion. Any RDF knowledge base that takes advantage of the logical seperation that named graphs provide will inevitably find itself being asked such questions.

Now I've said all this not to berate the current incarnation of Oracle's RDF solution but to take the opportunity to underline the value in a perspective that is often shoved aside by the vigor of semantic web evangelism. To be fair, the inability to dispatch conjunctive queries is pretty much the only criticism of the Oracle 10g R2 RDF model. I've been aware of it for some time, but didn't want to speak to the point directly until it was 'public knowledge.'

The Oracle 10g R2 RDF implementation demonstrates amongst other things:

  • Namespace management
  • Interned identifiers
  • Reification
  • Collections / Containers
  • Forward-chained rule firing (with a default ruleset for RDFS entailment)
  • Off the chart volume capability (.5 - 5 second response time on 80 Million triples - impressive regardless of the circumstance of the benchmark)
  • Native query format (SPARQL-ish SDORDFMATCH function)

You can check it out the DBA manual for the infrastructure and the uniprot benchmarks for performance.

I've too long been frustrated by the inability of 'industry leaders' to put their money where their mouth is when it comes to adoption of 'unproved' technologies. Too often, the biggest impedance to progress from the academic realm to the 'enterprise' realm is politics. People who simply want to solve difficult problems with intelligent and appropriate technologies have their work cut out for them against the inevitable collisions with politics and technological camp warefare (you say microformats, I say architectural forms, you say SOA, I say REST). So for that reason, it makes me somewhat optimistic that a company that truly has every thing to lose in doing so decided to make such a remarkable first step. Their recent purchase of Berkeley DB XML (the most widely supported open-source Native XML datastore) is yet another example of a bold step towards ubiquitous semi-structured persistence. But please, top it off with support for conjunctive queries.

[Uche Ogbuji]

via Copia

Store-agnostic REGEX Matching and Thread-safe Transactional Support in rdflib

[by Chimezie Ogbuji]

rdflib now has (checked into svn trunk) support for REGEX matching of RDF terms and thread-safe transactional support. The transactional wrapper provides Atomicity, Isolation, but not Durability (a list of reversal RDF operations is stored on the live instance - so they won't survive a system failure). The store implementation is responsible for Consistency.

The REGEX wrapper provides a REGEXTerm which can be used in any of the RDF term 'slots' with:

It replaces any REGEX term with a wildcard (None) and performs the REGEX match after the query invokation is dispatched to the store implementation it is wrapping.

Both are meant to work with a live instance of an RDF Store, but behave as a proxy for the store (providing REGEX and/or transactional support).

For example:

from rdflib.Graph import ConjunctiveGraph, Graph
from import REGEXTerm, REGEXMatching
from import AuditableStorage
from import Store
from rdflib import plugin, URIRef, Literal, BNode, RDF

store = plugin.get('IOMemory',Store)()
regexStorage = REGEXMatching(store)
txRegex =  AuditableStorage(regexStorage)
print len(g),[t for t in g.triples((REGEXTerm('.*zie$'),None,None))]
print len(g),[t for t in g]

Results in:

492 [(u'', u'', u''), (u'', u'', u''), (u'', u'', u'QQxcRclE1'), (u'', u'', u''), (u'', u'', u'')] 0 []

[Chimezie Ogbuji]

via Copia