Closed World Assumptions, Conjunctive Querying, and Oracle 10g

I promised myself I would write at least one entry related to my experience at the 2006 Semantic Technology Conference here in San Jose, which has been an incredibly well attended and organized conference. I've found myself wanting to do less talking and more problem solving lately, but I came across an issue that has generated the appropriate amount of motivation.

For some time I've been (eagerly) monitoring Oracle's recent advances with their latest release (10g R2) which (amongst other things) introduced (in my estimation) what will turn out to be a major step in bridging the gap between the academic dream of the Semantic Web and the reality of the day-to-day problems that are relevant to technologies in that sphere of influence.

But first things first (as the saying goes). Basically, the Oracle 10g R2 RDF implementation supports the logical separation of RDF triples into named Models as well as the ability to query across explict sets of Models. However, the querying mechanism (implemented as an extension to SQL – SDO_RDF_MATCH) doesn't support the ability to query across the entire fact / knowledge base – i.e., the aggregation of all the named Models contained within.

I like to refer to this kind of a query as a Conjunctive Query. The term isn't mine, but it has stuck, and has made its way into the rdflib Store API. In fact, the rdflib API now has the concept of a Conjunctive Graph which behaves like a named graph with the exception that the query space is the set of all named graphs in the knowledge base.

Now, it would be an easy nitpick to suggest that since the formal RDF Model doesn't provide any guidance on the separation of RDF triples into addressable graphs, implementors can not be held at fault for deciding not to support such a separation. However, the large body of literature on Named Graphs as well as the support for querying within named sets of graphs in the more contemporary RDF querying languages does suggest that there is real value in separating raw triples this way and in being able to query across these logical separations transparently.

I think the value is twofold: Closed World Assumptions and query performance. Now, the notion of a boundary of known facts, will probably raise a red flag amongst semantic web purists and some may suggest that closed world assumptions cut against the grain of a vision of a massively distributed expert system. For the uninitiated, open world assumptions are where the absence of an assertion in your fact base does not necessarily suggest that the assertion (or statement) isn't true. That is, if the statement 'the sky is blue' is not in the knowledge base, you can not automatically assume that the sky is not blue.

This limitation makes sense where the emphasis is on the distribution of data (a key component of the semantic web vision), however it essentially abandons the value in applying the formal semantics of RDF (and knowledge representation, in general) to closed systems – systems where the data is complete to a certain extent and makes sense to live in a query silo.

The most practical example I can think of is the one I work with daily: medical research data that is often subjected to statistical analysis for deducing trends. You can't make suggestions derived from statistical trends in your data if you don't have some minimal confidence that the set of data you are working with is 'complete' enough to answer the questions you set out to ask.

Closed world assumptions also open the door to other heuristic optimizations that are directly relevant to query processors.

Finally, where RDF databases are built on top of SQL stores, being able to partition your query space into an additional indexable constraint (I say additional, because there are other partitioning techniques that impact scalability and response) makes a world of difference in a framework that has already been rigorously tuned to take maximal advantage of such rectangular partitioning. To a SQL store implementing an RDF model, the name (or names) of a graph is a low cardinality, indexable, constraint (there will always be less graphs than total triples) that can be the difference of several orders of magnitude in the overall query response time.

Named contexts lend themselves quite well to two-part queries where the first part identifies a set of named graphs (within a conjunctive graph or known universe) that match a certain criteria and then query only within those matching graphs. Once the query resolver has identified the named graphs, the second part of the query can be dispatched in a very targeted fashion. Any RDF knowledge base that takes advantage of the logical seperation that named graphs provide will inevitably find itself being asked such questions.

Now I've said all this not to berate the current incarnation of Oracle's RDF solution but to take the opportunity to underline the value in a perspective that is often shoved aside by the vigor of semantic web evangelism. To be fair, the inability to dispatch conjunctive queries is pretty much the only criticism of the Oracle 10g R2 RDF model. I've been aware of it for some time, but didn't want to speak to the point directly until it was 'public knowledge.'

The Oracle 10g R2 RDF implementation demonstrates amongst other things:

Namespace management
Interned identifiers
Reification
Collections / Containers
Forward-chained rule firing (with a default ruleset for RDFS entailment)
Off the chart volume capability (.5 - 5 second response time on 80 Million triples - impressive regardless of the circumstance of the benchmark)
Native query format (SPARQL-ish SDORDFMATCH function)

You can check it out the DBA manual for the infrastructure and the uniprot benchmarks for performance.

I've too long been frustrated by the inability of 'industry leaders' to put their money where their mouth is when it comes to adoption of 'unproved' technologies. Too often, the biggest impedance to progress from the academic realm to the 'enterprise' realm is politics. People who simply want to solve difficult problems with intelligent and appropriate technologies have their work cut out for them against the inevitable collisions with politics and technological camp warefare (you say microformats, I say architectural forms, you say SOA, I say REST). So for that reason, it makes me somewhat optimistic that a company that truly has every thing to lose in doing so decided to make such a remarkable first step. Their recent purchase of Berkeley DB XML (the most widely supported open-source Native XML datastore) is yet another example of a bold step towards ubiquitous semi-structured persistence. But please, top it off with support for conjunctive queries.

[Uche Ogbuji]

via Copia