Binary Predicates in FOPL and at Large Volumes

I've wanted to comeback to the issue of RDF scalability of a relational model for some time (a topic that has been on my mind for some time). Earlier, I mentioned a Description Logics (DL) representation technique that would dramatically reduce the amount of size needed for most RDF graphs. I only know of one other RDF store (besides rdflib) that does this. At large volumes, metrics of query response time are more succeptible to space efficiency than pure seek time. At some point along the numerical scale, there will be a point where the amount of time it takes to resolve a query is more directly affected by the size of the knowledge base than anything else. When you consider the URI lexical grammar, skolemization, seek times, BTrees, and Hash-tables even interning (by that I mean the general reliance on uniqueness in crafting identifiers) has little effect to the high-volume metrics of FOPL.

Perhaps something more could be said about the efficiency of DL? I've suggested the possiblity of semantic compression (or 'forward consumption' if you think of it as analagous to forward chaining) where what can be implied is never added or is removed by some intelligent process (perhaps periodically). For example, consider a knowledge base that only stored 'knows' relationships (foaf:knows, perhaps) between between people. It would be very redundant to state that two individual are 'People' (foaf:Person) if they know each other (66.6% space saving right there). Couldn't the formality of DL be used to both enhance expressiveness as well as efficiency? In the same way that invariant representations make our neocortex so much more efficient at logical prediction? If not DL, perhaps at least the formality of a local domain ontology and it's rules? I was able to apply the same principle (though not in any formal way you could automate) to improve the speed of a content management knowledge base.

[Uche Ogbuji]

via Copia

Closed World Assumptions, Conjunctive Querying, and Oracle 10g

I promised myself I would write at least one entry related to my experience at the 2006 Semantic Technology Conference here in San Jose, which has been an incredibly well attended and organized conference. I've found myself wanting to do less talking and more problem solving lately, but I came across an issue that has generated the appropriate amount of motivation.

For some time I've been (eagerly) monitoring Oracle's recent advances with their latest release (10g R2) which (amongst other things) introduced (in my estimation) what will turn out to be a major step in bridging the gap between the academic dream of the Semantic Web and the reality of the day-to-day problems that are relevant to technologies in that sphere of influence.

But first things first (as the saying goes). Basically, the Oracle 10g R2 RDF implementation supports the logical separation of RDF triples into named Models as well as the ability to query across explict sets of Models. However, the querying mechanism (implemented as an extension to SQL – SDO_RDF_MATCH) doesn't support the ability to query across the entire fact / knowledge base – i.e., the aggregation of all the named Models contained within.

I like to refer to this kind of a query as a Conjunctive Query. The term isn't mine, but it has stuck, and has made its way into the rdflib Store API. In fact, the rdflib API now has the concept of a Conjunctive Graph which behaves like a named graph with the exception that the query space is the set of all named graphs in the knowledge base.

Now, it would be an easy nitpick to suggest that since the formal RDF Model doesn't provide any guidance on the separation of RDF triples into addressable graphs, implementors can not be held at fault for deciding not to support such a separation. However, the large body of literature on Named Graphs as well as the support for querying within named sets of graphs in the more contemporary RDF querying languages does suggest that there is real value in separating raw triples this way and in being able to query across these logical separations transparently.

I think the value is twofold: Closed World Assumptions and query performance. Now, the notion of a boundary of known facts, will probably raise a red flag amongst semantic web purists and some may suggest that closed world assumptions cut against the grain of a vision of a massively distributed expert system. For the uninitiated, open world assumptions are where the absence of an assertion in your fact base does not necessarily suggest that the assertion (or statement) isn't true. That is, if the statement 'the sky is blue' is not in the knowledge base, you can not automatically assume that the sky is not blue.

This limitation makes sense where the emphasis is on the distribution of data (a key component of the semantic web vision), however it essentially abandons the value in applying the formal semantics of RDF (and knowledge representation, in general) to closed systems – systems where the data is complete to a certain extent and makes sense to live in a query silo.

The most practical example I can think of is the one I work with daily: medical research data that is often subjected to statistical analysis for deducing trends. You can't make suggestions derived from statistical trends in your data if you don't have some minimal confidence that the set of data you are working with is 'complete' enough to answer the questions you set out to ask.

Closed world assumptions also open the door to other heuristic optimizations that are directly relevant to query processors.

Finally, where RDF databases are built on top of SQL stores, being able to partition your query space into an additional indexable constraint (I say additional, because there are other partitioning techniques that impact scalability and response) makes a world of difference in a framework that has already been rigorously tuned to take maximal advantage of such rectangular partitioning. To a SQL store implementing an RDF model, the name (or names) of a graph is a low cardinality, indexable, constraint (there will always be less graphs than total triples) that can be the difference of several orders of magnitude in the overall query response time.

Named contexts lend themselves quite well to two-part queries where the first part identifies a set of named graphs (within a conjunctive graph or known universe) that match a certain criteria and then query only within those matching graphs. Once the query resolver has identified the named graphs, the second part of the query can be dispatched in a very targeted fashion. Any RDF knowledge base that takes advantage of the logical seperation that named graphs provide will inevitably find itself being asked such questions.

Now I've said all this not to berate the current incarnation of Oracle's RDF solution but to take the opportunity to underline the value in a perspective that is often shoved aside by the vigor of semantic web evangelism. To be fair, the inability to dispatch conjunctive queries is pretty much the only criticism of the Oracle 10g R2 RDF model. I've been aware of it for some time, but didn't want to speak to the point directly until it was 'public knowledge.'

The Oracle 10g R2 RDF implementation demonstrates amongst other things:

  • Namespace management
  • Interned identifiers
  • Reification
  • Collections / Containers
  • Forward-chained rule firing (with a default ruleset for RDFS entailment)
  • Off the chart volume capability (.5 - 5 second response time on 80 Million triples - impressive regardless of the circumstance of the benchmark)
  • Native query format (SPARQL-ish SDORDFMATCH function)

You can check it out the DBA manual for the infrastructure and the uniprot benchmarks for performance.

I've too long been frustrated by the inability of 'industry leaders' to put their money where their mouth is when it comes to adoption of 'unproved' technologies. Too often, the biggest impedance to progress from the academic realm to the 'enterprise' realm is politics. People who simply want to solve difficult problems with intelligent and appropriate technologies have their work cut out for them against the inevitable collisions with politics and technological camp warefare (you say microformats, I say architectural forms, you say SOA, I say REST). So for that reason, it makes me somewhat optimistic that a company that truly has every thing to lose in doing so decided to make such a remarkable first step. Their recent purchase of Berkeley DB XML (the most widely supported open-source Native XML datastore) is yet another example of a bold step towards ubiquitous semi-structured persistence. But please, top it off with support for conjunctive queries.

[Uche Ogbuji]

via Copia

Addressing the RDF Scalability Bottleneck

[by Chimezie Ogbuji]

I've been building RDF persistence stores for some time (it's gone from something of a hobby to the primary responsibility in my current work capacity) and have come to the conclusion that RDF stores will almost always be succeptible to the physical limitations of database scalability.

I recall when I was at the Semantic Technology Conference this spring and asked one of the presenters there what he thought about this problem that all RDF databases face and the reason why most don't function effectively beyond 5-10 million triples. I liked his answer:

It's an engineering problem

Consider the amount of information an adult has stored (by whatever mechanism the human brain uses to persist information) in his or her noggin. We often take it for granted - as we do all other aspects of biology we know very little about - but it's worth considering when thinking about why scalability is a ubiquitous hurdle for RDF databases.

Some basic Graph theory is relevant to this point:

The size of a graph is the number of edges and the order of a graph is the number of nodes within the graph. RDF is a Resource Description Framework (where what we know about resources is key - not so much the resouces themselves) so it's not surprising that RDF graphs will almost always have a much larger size than order. It's also not suprising that most performance analysis made across RDF implementations (such as LargeTripleStores for instance) focus mostly on triple size.

I've been working on a SQL-based persistence schema for RDF content (for rdflib) that is a bit different from the standard approaches taken by most RDBMS implementations of RDF stores I'm familiar with (including those I've written). Each of the tables are prefixed with a SHA1-hashed digest of the identifier associated with the 'localized universe' (AKA,the boundary for a closed world assumptions). The schema is below:

CREATE TABLE %s_asserted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,    
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),    
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

 CREATE TABLE %s_type_statements (
     member        text not NULL,
     klass         text not NULL,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     INDEX termComb_index (termComb),
     INDEX memberC_index (member(100),klass(100),context(50)),
     INDEX klassC_index (klass(100),context(50)),
     INDEX c_index (context(10))) TYPE=InnoDB"""

 CREATE TABLE %s_quoted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

The first thing to note is that statements are partitioned into logical groupings:

  • Asserted non rdf:type statements: where all asserted RDF statements where the predicate isn't rdf:type are stored
  • Asserted rdf:type statements: where all asserted rdf:type statements are stored
  • Quoted statements: where all quoted / hypothetical statements are stored

Statement quoting is a Notation 3 concept and an extension of the RDF model for this purpose. The most significant partition is the rdf:type grouping. The idea is to have class membership modeled at the store level instead of at a level above it. RDF graphs are as different as the applications that use them but the primary motivating factor for making this seperation was the assumption that in most RDF graphs a majority of the statements (or a significant portion) would consist of rdf:type statements (statements of class membership).

Class membership can be considered an unstated RDF modelling best practice since it allows an author to say alot about a resource simply by associating it with a class that has it's semantics completely spelled out in a separate, supporting ontology.

The rdf:type table models class membership explicitely with two columns: klass and member. This results in a savings of 43 characters per rdf:type statement. The implementation takes note of the predicate submitted in triple-matching pattern and determines which tables to search.
Consider the following triple pattern:

http://metacognition.info ?predicate ?object

The persistence layer would know it needs to check against the table that persists non rdf:type statements as well as the class membership table. However, patterns that match against a specific predicate (other than rdf:type) or class membership queries only need to check within one partition (or table):

http://metacognition.info rdf:type ?klass

In general, I've noticed that being able to partition your SQL search space (searching within a named graph / context or searching within a single table) goes along way in query response.

The other thing worth noting is the termComb column, which is an integer value representing the 40 unique ways the following RDF terms could appear in a triple:

  • URI Ref
  • Blank Node
  • Formula
  • Literal
  • Variable

I'm certain there are many other possible optimizations that can be made in a SQL schema for RDF triple persistence (there isn't much precedent in this regard - and Oracle has only recently joined the foray) but partitioning rdf:type statements seperately is one such thought I've recently had.

[Uche Ogbuji]

via Copia