Binary Predicates in FOPL and at Large Volumes

I've wanted to comeback to the issue of RDF scalability of a relational model for some time (a topic that has been on my mind for some time). Earlier, I mentioned a Description Logics (DL) representation technique that would dramatically reduce the amount of size needed for most RDF graphs. I only know of one other RDF store (besides rdflib) that does this. At large volumes, metrics of query response time are more succeptible to space efficiency than pure seek time. At some point along the numerical scale, there will be a point where the amount of time it takes to resolve a query is more directly affected by the size of the knowledge base than anything else. When you consider the URI lexical grammar, skolemization, seek times, BTrees, and Hash-tables even interning (by that I mean the general reliance on uniqueness in crafting identifiers) has little effect to the high-volume metrics of FOPL.

Perhaps something more could be said about the efficiency of DL? I've suggested the possiblity of semantic compression (or 'forward consumption' if you think of it as analagous to forward chaining) where what can be implied is never added or is removed by some intelligent process (perhaps periodically). For example, consider a knowledge base that only stored 'knows' relationships (foaf:knows, perhaps) between between people. It would be very redundant to state that two individual are 'People' (foaf:Person) if they know each other (66.6% space saving right there). Couldn't the formality of DL be used to both enhance expressiveness as well as efficiency? In the same way that invariant representations make our neocortex so much more efficient at logical prediction? If not DL, perhaps at least the formality of a local domain ontology and it's rules? I was able to apply the same principle (though not in any formal way you could automate) to improve the speed of a content management knowledge base.

[Uche Ogbuji]

via Copia
2 responses
Only naive guesses, but some things seem likely to me. I would imagine doing the inference at query time is going to be less efficient than a simple lookup, at least for all but really easy inferences. But where you do have the ontology up front, a lot of the common inferences could maybe be translated into fairly straightforward lookup. Take the foaf:knows example, if you have a query ?x rdf:type foaf:Person, you could /almost/ hardcode the A and B of A foaf:knows B.

Where you don't have such prior knowledge, I suspect that there's potentially more speed to be gained from the totally verbose representation, but laying the indexing on thick.

Given the open world model, it seems reasonable to think that delete/update will be rare compared to insert and select - there may be optimisations possible as a result. Dunno, guesses ;-)
From Graham Klye:

The trouble with DLs is that, in general, they're lousy for doing inference

about instances (as opposed to about classes), and I think the compression idea

really needs instance inferences.

Exactly.. Which is why the 'local' ontology semantics are so valuable.  Rete-network reasoning against a small set of facts (the ontological assertions) is so much quicker than pure variable resolution against the factbase.  Unfortunately, I've only ever used a forward-chaining reasoner and the best back-ward chaining reasoner I know (Euler) I haven't ever used (though it's home page says it has python implementations, which is intersting)

On insertion into a database, try backward

chaining using these rules and see if the facts are already present.  If so,

ignore the new one.  Then on retrieval (at the receiving end) apply the forward

chaining, using some kind of rete network approach.  This isn't lightweight to

do, but the thesis here is that the volume-transfer is potentially more

expensive than the inference, right?

It might even actually be lightwight if the forward-chaining is done against the 'local' graph (the current query context both SPARQL and Versa have such a notion)