Progress on two Reference Implementations for RETE and GRDDL

Whew! During the moments I was able to spare while at ISWC (I was only there on monday and tuesday, unfortunately), I finished up two 'reference' implementations I've been enthusiastically hacking on quite recently. By reference implementation, I mean an implementation that attempts to follow a specification verbatim as an exercise to get a better understanding of it.

I still need a write up on the engaging conversations I had at ISWC (it was really an impressive conference, even from just the 2 days worth I got to see) as well as the presentation I gave on using GRDDL and Web architecture to meet the requirements of Computer-based Patient Records.

FuXi and DLP

The first milestone was with FuXi, which I ended up rewriting completely based on some exchanges with Peter Lin and Charles Young.

This has probably been the most challenging piece of software I've ever written and I was quite niave in the beginning about my level of understanding of the nuances of RETE. Anyone interested in the formal intersection of Notation 3 / RDF syntax and the RETE algorithm(s) will find the exchanges in the comments of the above post very instructive - and Peter Lin's blog in general. Though he and I have our differences in the value of mapping RETE to RDF/N3 his insights into my efforts have been very helpful.

In the process, I discovered Robert Doorenbos PhD thesis "Production Matching for Large Learning Systems" which was incredibly valuable in giving me a comprehensive picture of how RETE (and RETE/UL) could be 'ported' to accomodate Notation 3 reasoning.

The primary motivation (which has led to many sleepless nights and late night hackery) is what I see as a lack of understanding within the community of semantic web developers of the limitations of Tableux-based reasoning and the common misconception that the major Tableaux-based reasoners (Fact++, Racer, Pellet, etc..) represent the ceiling of DL reasoning capability.

The reality is that logic programming has been around the block much longer than DL and has much more mature algorithms available (the primary one being RETE). I've invested quite a bit of effort in what I believe will (eventually) demonstrate very large-scale DL reasoning performance that will put Tableaux-based reasoning to shame - if only to make it clear that more investigation into the intersection of LP and DL is crucial for making the dream of the SW a verbatim reality.

Ofcouse, there is a limit to what aspects of DL can be expressed as LP rules (this subset is called Description Logic Programming). The 'original' DLP paper does well to explain this limitation, but I believe this subset represents the more commonly used portions of OWL and the portions of OWL 1.0 (and OWL 1.1 for that matter) left out by such an intersection will not be missed.
Ivan Herman pointed me to a paper by Horst, Herman which is quite comprehensive in outlining how this explicit intersection can be expressed axiomatically and the computational consequences of such an axiomatization. Ivan used this a guide for his RDFSClosure module.

Not enough (IMHO) has been done to explore this intersection because people are comfy with the confines of non-LP algorithms. The trail (currently somewhat cold) left by the Mindswap work on Pychinko needs to be picked up, followed and improved.

So, I rolled up my sleeves, dug deep and did my best to familiarize myself with the nuances of production system optimization. Most of the hard work has already been done, thanks to Robert Doorenbos subsetting (and extension) of the original Charles Forgy algorithm. FuXi, gets through a large majority the OWL tests using a ruleset that closely implements what Horst lays out in his paper and does so with impressive times - even with more optimizations pending.

The most recent changes include a command-line interface for launching it:

chimezie@Zion:~/devel/Fuxi$ python Fuxi.py --out=n3 --ruleFacts
--ns=owl=http://www.w3.org/2002/07/owl#
--ns=test=http://metacognition.info/FuXi/DL-SHIOF-test.n3#
--rules=test/DL-SHIOF-test.n3
Time to build production rule (RDFLib): 0.0172629356384 seconds
Time to calculate closure on working memory: 224.906921387 m seconds

@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix test: <http://metacognition.info/FuXi/DL-SHIOF-test.n3#>.

 test:Animal test:in _:XRfZIlKy56,
        _:XRfZIlKy57.

 test:Cosi_fan_tutte a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,
        _:XRfZIlKy48,
        [].

 test:Don_Giovanni a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,
        _:XRfZIlKy48.

 .... snip ...

FuXi still doesn't support 'built-ins' (or custom comparisons), but luckily the thesis includes a section on how to implement non equality testing of rule constraints that should be (relatively) easy to add. The theseis also includes a section on how negated conditions can be implemented (which is probably the most glaring axiom missing from DLP). Finally, Robert's paper includes a clever mechanism for hashing the Alpha network that has yet to be implemented (and is simple enough to implement) that should contribute significant performance gains.

There are other pleasant surprises with the current codebase. The rule compiler can be used to identify inefficencies in rule patterns, the command-line program can be used to serialize the closure delta (i.e., only the triples inferred from the ruleset and facts), and (my favorite) a Notation 3 ruleset can be exported as a graphviz diagram in order to visualize the rule network. Having 'browsed' various rule-sets in this way, I must say it helps in understanding the nuances of optimization when you can see the discrimination network that the triples are propagated through.

I don't have a permanent home for FuXi yet but have applied for a sourceforge project (especially since SF now supports SVN, apparently). So, until then, FuXi can be downloaded from:

GRDDL Client for 4Suite and RDFLib

During the same period, I've also been working on a 'reference' implementation of GRDDL (per the recently released Working Draft) for 4Suite and RDFLib. It's a bit ironic in doing so, since the 4Suite repository framework has essentially been using GRDDL-like Content Management Systems mechanisms since its inception (sometime in 2001).
However, I thought doing so would be the perfect oppurtunity to:

  • Demonstrate how 4Suite can be used with RDFLib (as a prep to the pending deprecation of 4Suite RDF for RDFLib)
  • Build a framework to compose illustrative test cases for the Working Group (of which I'm a member)
  • As a way to completely familiarize myself with the various GRDDL mechanisms

I posted this to the Working Group mailing list and plan to continue working on it. In particular, the nice thing about the convergence of these two projects of mine is that I've been able to think a bit about how both GRDDL and Fuxi could be used to implement efficient, opportunistic programs that extract RDF via GRDDL and explicit links (rdf:seeAlso, owl:imports, rdfs:isDefinedBy) and perform incremental web closures by adding the triples discovered in this way one at a time to a live RETE network.

The RETE algorithm is tailored specifically to behave as a black box to changes to a production system and so crawling the web, extracting RDF (a triple at a time) and reasoning over it in this way (the ultimate semantic web scenario) becomes a very real scenario with sub-second response times to boot. At the very least, reasoning should cease to be much of a bottleneck compared to the actual dereferencing and parsing of RDF from distributed locations. Very exciting prospects. Watch this space..

Chimezie Ogbuji

via Copia

A Relational Model for FOL Persistance

A short while ago I was rather engaged in investigating the most efficient way to persist RDF on Relational Database Management Systems. One of the outcomes of this effort that I have yet to write about is a relational model for Notation 3 abstract syntax and a fully funcitoning implementation - which is now part of RDFLib's MySQL drivers.

It's written in with Soft4Science's SciWriter and seems to render natively in Firefox alone (havne't tried any other browser)

Originally, I kept coming at it from a pure Computer Science approach (programming and datastructures) but eventually had to roll my sleeves and get down to the formal logic level (i.e., the Deconstructionist, Computer Engineer approach).

Partitioning the KR Space

The first method with the most impact was seperating Assertional Box statements (statements of class membership) from the rest of the Knowledge Base. When I say Knowledge Base, I mean a 'named' aggregation of all the named graphs in an RDF database. Partitioning the Table space has a universal effect on shortening indices and reducing the average number of rows needed to be scanned for even the worts case scenario for a SQL optimizer. The nature of RDF data (at the syntactic level) is a major factor. RDF is Description Logics-oriented representation and thus relies heavily on statements of class membership.

The relational model is all about representing everything as specific relations and the 'instanciation' relationship is a perfect candidate for a database table.

Eventually, it made sense to create additional table partitions for:

  • RDF statments between resources (where the object is not an RDF Literal).
  • RDF's equivalent to EAV statements (where the object is a value or RDF Literal).

Matching Triple Patterns against these partitions can be expressed using a decision tree which accomodates every combination of RDF terms. For example, a triple pattern:

?entity foaf:name "Ikenna"

Would only require a scan through the indices for the EAV-type RDF statements (or the whole table if neccessary - but that decision is up to the underlying SQL optimizer).

Using Term Type Enumerations

The second method involves the use of the enumeration of all the term types as an additional column whose indices are also available for a SQL query optimizer. That is:

ANY_TERM = ['U','B','F','V','L']

The terms can be partitioned into the exact allowable set for certain kinds of RDF terms:

ANY_TERM = ['U','B','F','V','L']
CONTEXT_TERMS   = ['U','B','F']
IDENTIFIER_TERMS   = ['U','B']
GROUND_IDENTIFIERS = ['U']
NON_LITERALS = ['U','B','F','V']
CLASS_TERMS = ['U','B','V']
PREDICATE_NAMES = ['U','V']

NAMED_BINARY_RELATION_PREDICATES = GROUND_IDENTIFIERS
NAMED_BINARY_RELATION_OBJECTS    = ['U','B','L']

NAMED_LITERAL_PREDICATES = GROUND_IDENTIFIERS
NAMED_LITERAL_OBJECTS    = ['L']

ASSOCIATIVE_BOX_CLASSES    = GROUND_IDENTIFIERS

For example, the Object term of an EAV-type RDF statment doesn't need an associated column for the kind of term it is (the relation is explicitely defined as those RDF statements where the Object is a Literal - L)

Efficient Skolemization with Hashing

Finally. thanks to Benjamin Nowack's related efforts with ARC - a PHP-based implementation of an RDF / SPARQL storage system, Mark Nottinghams suggestion, and an earlier paper by Stephen Harris and Nicholas Gibbins: 3store: Efficient Bulk RDF Storage, a final method of using a half-hash (MD5 hash) of the RDF identifiers in the 'statement' tables was employed instead. The statements table each used an unsigned MySQL BIGint to encode the half hash in base 10 and use as foreign keys to two seperate tables:

  • A table for identifiers (with a column that enumerated the kind of identifier it was)
  • A table for literal values

The key to both tables was the 16 byte unsigned integer which represented the half-hash

This ofcourse introduces a possibility of collision (due to the reduced hash size), but by hashing the identifier along with the term type, this further dilutes the lexical space and reduces this collision risk. This latter part is still a theory I haven't formally proven (or disproven) but hope to. At the maximum volume (around 20 million RDF assertions) I can resolve a single triple pattern in 8 seconds on an SGI machine and there is no collision - the implementation includes (disabled by default) a collision detection mechanism.

The implementation includes all the magic needed to generate SQL statements to create, query, and manage indices for the tables in the relational model. It does this from a Python model that encapsulates the relational model and methods to carry out the various SQL-level actions needed by the underlying DBMS.

For me, it has satisfied my needs for an open-source maximally efficient RDBM upon which large volume RDF can be persisted, within named graphs, with the ability to persist Notation 3 formulae in a seperate manner (consistent with Notation 3 semantics).

I called the Python module FOPLRelationModel because although it is specifically a relational model for Notation 3 syntax it covers much of the requirements for the syntactic representation of First Order Logic in general.

Chimezie Ogbuji

via Copia

Patterns and Optimizations for RDF Queries over Named Graph Aggregates

In a previous post I used the term 'Conjunctive Query' to refer to a kind of RDF query pattern over an aggregation of named graphs. However, the term (apparently) has already-established roots in database querying and has a different meaning that what I intended. It's a pattern I have come across often and is for me a major requirement for an RDF query language, so I'll try to explain by example.

Consider two characters, King (Wen) and his heir / son (Wu) of the Zhou Dynasty. Let's say they each have a FOAF graph about themselves and the people they know within a larger database which holds the FOAF graphs of every historical character in literature.

The FOAF graphs for both Wen and Wu are (preceeded by the name of each graph):

<urn:literature:characters:KingWen>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wen_of_Zhou> a :Person;
    :name “King Wen”;
    :mbox <mailto:kingWen@historicalcharacter.com>;
    rel:parentOf [ a :Person; :mbox <mailto:kingWu@historicalcharacter.com> ].

<urn:literature:characters:KingWu>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wu_of_Zhou> a :Person;
    :name “King Wu”;
    :mbox <mailto:kingWu@historicalcharacter.com>;
    rel:childOf [ a :Person; :mbox <mailto:kingWen@historicalcharacter.com> ].

In each case, Wikipedia URLs are used as identifiers for each historical character. There are better ways for using Wikipedia URLs within RDF, but we'll table that for another conversation.

Now lets say a third party read a few stories about “King Wen” and finds out he has a son, however, he/she doesn't know the son's name or the URL of either King Wen or his son. If this person wants to use the database to find out about King Wen's son by querying it with a reasonable response time, he/she has a few thing going for him/her:

  1. foaf:mbox is an owl:InverseFunctionalProperty and so can be used for uniquely identifying people in the database.
  2. The database is organized such that all the out-going relationships (between foaf:Persons – foaf:knows, rel:childOf, rel:parentOf, etc..) of the same person are asserted in the FOAF graph associated with that person and nowhere else.
    So, the relationship between King Wen and his son, expressed with the term ref:parentOf, will only be asserted in
    urn:literature:characters:KingWen.

Yes, the idea of a character from an ancient civilization with an email address is a bit cheeky, but foaf:mbox is the only inverse functional property in FOAF to use to with this example, so bear with me.

Now, both Versa and SPARQL support restricting queries with the explicit name of a graph, but there are no constructs for determining all the contexts of an RDF triple or:

The names of all the graphs in which a particular statement (or statements matching a specific pattern) are asserted.

This is necessary for a query plan that wishes to take advantage of [2]. Once we know the name of the graph in which all statements about King Wen are asserted, we can limit all subsequent queries about King Wen to that same graph without having to query across the entire database.

Similarly, once we know the email of King Wen's son we can locate the other graphs with assertions about this same email address (knowing they refer to the same person [1]) and query within them for the URL and name of King Wen's son. This is a significant optimization opportunity and key to this query pattern.

I can't speak for other RDF implementations, but RDFLib has a mechanism for this at the API level: a method called quads((subject,predicate,object)) which takes three terms and returns a tuple of size 4 which correspond to the all triples (across the database) that match the pattern along with the graph that the triples are asserted in:

for s,p,o,containingGraph in aConjunctiveGraph.quads(s,p,o):
  ... do something with containingGraph ..

It's likely that most other QuadStores have similar mechanisms and given the great value in optimizing queries across large aggregations of named RDF graphs, it's a strong indication that RDF query languages should provide the means to express such a mechanism.

Most of what is needed is already there (in both Versa and SPARQL). Consider a SPARQL extension function which returns a boolean indicating whether the given triple pattern is asserted in a graph with the given name:

rdfg:AssertedIn(?subj,?pred,?obj,?graphIdentifier)

We can then get the email of King Wen's son efficiently with:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?mbox
WHERE {
    GRAPH ?foafGraph {
      ?kingWen :name “King Wen”;
                       rel:parentOf [ a :Person; :mbox ?mbox ].
    }  
     FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?foafGraph) ).
}

Now, it is worth noting that this mechanism can be supported explicitly by asserting provenance statements associating the people the graphs are about with the graph identifiers themselves, such as:

<urn:literature:characters:KingWen> 
  :primaryTopic <http://en.wikipedia.org/wiki/King_Wen_of_Zhou>.

However, I think that the relationship between an RDF triple and the graph in which it is asserted, although currently outside the scope of the RDF model, should have it's semantics outlined in the RDF abstract syntax instead of using terms in an RDF vocabulary. The demonstrated value in RDF query optimization makes for a strong argument:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?kingWu,  ?sonName
WHERE {
    GRAPH ?wenGraph {
      ?kingWen :name “King Wen”;
                       :mbox ?wenMbox;
                       rel:parentOf [ a :Person; :mbox ?wuMbox ].
    }  
    FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?wenGraph) ).
    GRAPH ?wuGraph {
      ?kingWu :name ?sonName;
                     :mbox ?wuMbox;
                     rel:childOf [ a :Person; :mbox ?wenMbox  ].
    }  
     FILTER (rdfg:AssertedIn(?kingWu,:name,?sonName,?wuGraph) ).
}

Generally, this pattern is any two-part RDF query across a database (a collection of multiple named graphs) where the scope of the first part of the query is the entire database, identifies terms that are local to a specific named graph, and the scope of the second part of the query is this named graph.

Chimezie Ogbuji

via Copia

Mapping Rete algorithm to FOL and then to RDF/N3

Well, I was hoping to hold off on the ongoing work I've doing with FuXi, until I could get a decent test suite working, but I've been engaged in several threads (older) that left wanting to elaborate a bit.

There is already a well established precedent with Python/N3/RDF reasoners (Euler, CWM, and Pychinko). FuXi used to rely on Pychinko, but I decided to write a Rete implementation for N3/RDF from scratch - trying to leverage the host language idioms (hashing, mappings, containers, etc..) as much as possible for areas where it could make a difference in rule evaluation and compilation.

What I have so far is more Rete-based than a pure Rete implementation, but the difference comes mostly from the impedance between the representation components in the original algorithm (which are very influenced by FOL and Knowledge Representation in general) and those in the semantic web technology stack.

Often with RDF/OWL, there is more talk than neccessary, so I'll get right to the meat of the semantic mapping I've been using. This assumes some familiarity with the original algorithm. Some references:

Tokens

The working memory of the network is fed by an N3 graph. Tokens represent the propagation of RDF triples (no variables or formulae identifiers) from the source graph through the rule network. These can represent token addition (where the triples are added to the graph - in which case the live network can be associated with a live RDF graph) and token removals (where triples are removed from the source graph). When tokens pass an alpha nodes intra-element test (see below) the tokens are passed on with a subtitution / mapping of variables in the pattern to the corresponding terms in the triples. This variable substitution is used to check for consistent variable bindings across beta nodes.

ObjectType Nodes and Working Memory

ObjectType nodes can be considered equivalent to the test for concept subsumption (in Description Logics). and therefore equivalent to the alpha node RDF pattern:

?member rdf:type ?klass.

'Classic' Alpha node patterns (the one below is taken directly from the original paper) map to multiple RDF-triple alpha node patterns:

(Expression ^Op X ^Arg2 Y)

Would be equivalent to the following triples patterns (the multiplicative factor is that RDF assertions are limited to binary predicates):

  • ?Obj rdf:type Expression
  • ?Obj Op ?X
  • ?Obj Arg2 ?Y

Alpha Nodes

Alpha nodes correspond to patterns in rules, which can be

  1. Triple patterns in N3 rules
  2. N-array functions.

Alpha node intra-element tests have a 'default' mechanism for matching triple patterns or they exhibit the behavior associated with a registered set of N-array functions - the core set coincide with those used by CWM/Euler/Pychinko (and often called N3 built-ins). Fuxi will support an extension mechanism for registering additional N-array N3 functions by name which associate them with a Python function that implements the constraint in a similar fashion to SPARQL extension functions. N-aray functions are automatically propagated through the network so they can participate in Beta Node activation (inter-element testing in Rete parlance) with regular triple patterns, using the bindings to determine the arguments for the functions.

The default mechanism is for equality of non-variable terms (URIs).

Beta Nodes

Beta nodes are pretty much verbatim Rete, in that they check for consistent variable substitution between their left and right memory.This can be considered similar to the unification routine common to both forward and backward chaining algorithms which make use of the Generalized Modus Ponens rule. The difference is that the sentences aren't being make to look the same but the existing variable substitutions are checked for consistency. Perhaps there is some merit in this similarity that would make using a Rete network to faciliate backward-chaining and proof generation an interesting possiblity, but that has yet to be seen.

Terminal Nodes

These correspond with the end of the LHS of a N3 rule, and is associated with the RHS and when 'activated' they 'fire' the rules, apply the propaged variable substitution, and add the newly inferred triples to the network and to the source graph.

Testing and Visualizing RDF/N3 Rete Networks

I've been able to adequately test the compilation process (the first of two parts in the original algorithm), using a visual aid. I've been developing a library for generating Boost Graph Library DiGraphs from RDFLib graphs - called Kaleidos. The value being in generating GraphViz diagrams, as well as access to a whole slew of graph heuristics / algorithms that could be infinitely useful for RDF graph analysis and N3 rule network analysis:

  • Breadth First Search
  • Depth First Search
  • Uniform Cost Search
  • Dijkstra's Shortest Paths
  • Bellman-Ford Shortest Paths
  • Johnson's All-Pairs Shortest Paths
  • Kruskal's Minimum Spanning Tree
  • Prim's Minimum Spanning Tree
  • Connected Components
  • Strongly Connected Components
  • Dynamic Connected Components (using Disjoint Sets)
  • Topological Sort
  • Transpose
  • Reverse Cuthill Mckee Ordering
  • Smallest Last Vertex Ordering
  • Sequential Vertex Coloring

Using Kaleidos, I'm able to generate visual diagrams of Rete networks compiled from RDF/OWL/N3 rule sets.

However, the heavy cost with using BGL is the compilation process of BGL and BGL python, which is involved if doing so from source.

Chimezie Ogbuji

via Copia

Practical Temporal Reasoning with Notation 3

Recently, Dan Brickley expressed an interest in the extent to which Bioinformatic research efforts are leveraging RDF for temporal reasoning (and patient healthcare record integration - in general). The thread on the value of modeling temporal relations explicitly versus relying on them being built into core RDF semantics left me feeling like a concrete example was in order.

We have a large (3500+ assertions) OWL Full ontology describing all the data we collect about Cardiothoracic procedures (the primary purpose of our database as currently constituted – in a relational model). There are several high-level classes we use to model concepts that, though core to our model, can be thought of as general enough for a common upper ontology for patient data.

One of the classes is ptrec:TemporalData (from here on out, I'll be using the ptrec prefix to describe vocabulary terms in our ontology) which is the ancestor of all classes that are expressed on an axis of time. We achieve a level of precision in modeling data on a temporal axis that enhances the kind of statistical analysis we perform on a daily basis.

In particular we use three variables:

  • ptrec:startDT (xs:dateTime)
  • ptrec:stopDT (xs:dateTime)
  • ptrec:instantDT (xs:dateTime)

The first two are used to describe an explicit (and 'proper') interval for an event in a patient record. This is often the case where the event in question only had a date associated with it. The latter variable is used when the event is instantaneous and the associated date / time is known.

The biggest challenge isn't simply the importance of time in asking questions of our data but of temporal factors that are keyed off specific, moving points of reference. For example, consider a case study on the effects of administering a medication within X days of specific procedure. The qualifying procedure is key to the observations we wish to make and behaves as a temporal anchor. Another case study interested in the effects of administering the same medication but with respect to a different procedure should be expected to rely on the same temporal logic – but keyed off a different point in time. However, by being explicit about how we place temporal data on a time axis (as instants or intervals) we can outline a logic for general temporal reasoning that can be used by either case study.

Linking into an OWL time ontology we can setup some simple Notation 3 rules for inferring interval relationships to aid such questions:

#Infering before and after temporal relationships (between instants and intervals alike)
{?a a ptrec:TemporalData;
    ptrec:instantDT ?timeA. 
 ?b a ptrec:TemporalData;
    ptrec:instantDT ?timeB. ?timeA str:greaterThan ?timeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB. ?startTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

#Infering during and contains temporal relationships (between proper intervals)
#Since there is no str:greaterThanOrEqual CWM function, the various permutations
#Are spelled out explicitely
{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:equalIgnoringCase ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:equalIgnoringCase ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

Notice the value in xs:dateTime values being ordered temporally and as unicode, simultaneously. This allows us rely on str:lessThan and str:greaterThan for determining interval intersection and overlap.

Terms such as 'preoperative' (which refer to events that occurred before a specific procedure / operation) and 'postoperative' (events that occurred after a specific procedure / operation), which are core to general medical research nomenclature, can be tied directly into this logic:

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intBefore ?b}
   => {?a ptrec:preOperativeWRT ?b}

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intAfter ?b}
   => {?a ptrec:postOperativeWRT ?b}

Here we introduce two terms (ptrec:preOperativeWRT and ptrec:postOperativeWRT) which relate temporal data with an operation in the same patient record. Using interval relationships as a foundation you can link in domain-specific, temporal vocabulary into your temporal reasoning model, and rely on a reasoner to setup a framework for temporal reasoning.

Imagine the value in using a backward-chaining prover (such as Euler) to logically demonstrate exactly why a specific medication (associated with the date when it was administered) is considered to be preoperative with respect to a qualifying procedure. This would complement the statistical analysis of a case study quite nicely with formal logical proof.

Now, it's worth noting that such a framework (as it currently stands) doesn't allow precision of interval relationships beyond simple intersection and overlap. For instance, in most cases you would be interested primarily in medication administered within a specific length of time. This doesn't really impact the above framework since it is no more than a functional requirement to be able to perform calendar math. Imagine if the built-in properties of CWM were expanded to include functions for performing date math. for instance:

With such a function we can expand our logical framework to include more explicit temporal relationships.
For example, if we only wanted to consider medications that were done 30 days prior to an operation to be considered 'preoperative':

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:Operation;
    ptrec:startDT ?opStartTime;
    ptrec:stopDT ?opStopTime.  
 ?a time:intBefore ?b.
 (?stopTime "-P30D") time:addDT ?preOpMin. ?stopTimeA str:lessThan ?preOpMin}
    => {?a ptrec:preOperativeWRT ?b}

It's worth noting that such an addition (to facilitate calendar math) would be quite useful as a general extension for RDF processors.

For the most part, I think a majority of the requirements needed for temporal reasoning (in any domain) can be accommodated by explicit modeling, because FOPL (the foundation upon which RDF is built) was designed to be expressive enough to represent all human concepts.

Chimezie Ogbuji

via Copia

RDF IRC Agent - Emeka

I've recently been working on an IRC bot (based on Sean Palmer's phenny) called Emeka which is meant as a tool for demonstrating Versa and other related RDF facilities. Currently, it supports two functions:

  • .query <abritrary URI> " .. Versa Query .. "
  • .query-deep <arbitrary URI> steps " .. Versa Query .. "

The first, causes Emeka to parse the RDF document (associated with the given URI) as RDF/XML and then as N3 (if the first attempt fails). He then evaluates the submitted Versa Query against the Graph and responds with a result. The second function does the same with the exception that it recursively loads RDF graphs (following rdfs:seeAlso statements) N times, where N is the second argument. This is useful for extracting FOAF communities from a starting document (which was my original motivation for this).

By default Emeka has the following namespace prefixes bound:

daml,rdf,rdfs,owl,xsd,log (n3's log), dc,rss,foaf

Emeka is a work in progress and is currently idling in #swhack and #4suite (as we speak and #foaf,#swig eventually). Some ideas for other services available from this bot:

  • augmenting it's default namespace mapping
  • stored queries (for example: query for retrieving the latest rss:item in a feed)
  • Rule invokation (through FuXi's prospective-query function)
  • Interactive question and example demonstration of Versa function(s)
  • More sophisticated interaction with Del.icio.us RSS feeds (for web page cataloging)

Other suggestions are welcome

see #swhack logs for an example

Chimezie Ogbuji

via Copia

FuXi: FOAFed and DOAPed

I just upgraded and repackaged FuXi (v0.7): added some extra prints in the Versa extension function, added a 'test' directory to the source tree with an appropriate example of how FuXi could be used to make a prospective query against OWL rules and a source graph, created a DOAP instance for FuXi, a FOAF instance for myself, created a permanent home for FuXi, and added FuXi to the SemanticWebDOAPBulletinBoard WiKi. This was primarily motivated by Libby's post on Semantic Web Applications and Demos. I thought it would be perfect forum for FuXi. I probably need to move it into a CVS repository when I can find time.

Below is the output of running the test case:

loaded model with owl/rdfs minimal rules and initial test facts
executing Versa query:
prospective-query('urn:uuid:MinimalRules',
                               'urn:uuid:InitialFacts',
                               'distribute(list(test:Lion,test:Don_Giovanni),\'. - rdf:type -> *\')',
                               'urn:uuid:InitialFacts')
extracted 35 rules from urn:uuid:MinimalRules
extracted 16 facts from source model (scoped by urn:uuid:InitialFacts) into interpreter. Executing...
inferred 15 statements in 0.0526609420776 seconds
time to add inferred statements (to scope: urn:uuid:InitialFacts): 0.000159025192261
compiled prospective query, executing (over scope: urn:uuid:InitialFacts)
time to execute prospective query:  0.0024938583374
time to remove inferred statements:  0.0132219791412
[[[u'http://copia.ogbuji.net/files/FuXi/test/Animal',
   u'http://copia.ogbuji.net/files/FuXi/test/LivingBeing']],
 [[u'http://copia.ogbuji.net/files/FuXi/test/DaPonteOperaOfMozart']]]

This particular test extracts inferred statements from an initial graph using a functional subset of the original owl-rules.n3 and executes the Versa query (which essentially asks: what classes do the resources test:Lion and test:Don_Giovanni belong to?) to demonstrate OWL reasoning (class membership extension by owl:oneOf and owl:unionOf, specifically).

see previous posts on FuXi/N3/4Suite:

Chimezie Ogbuji

via Copia

FuXi v0.6 - Rearchitected / Repackaged

I've been experimenting with the use of FuXi as an alternative in situations where I had been manipulating application-specific RDF content using Versa within a host language (XSLT). In some cases I've been able to reduce a very complex set of XSLT logic to 1-2 queries on RDF models extended via a handful of very concise rules (expressed as N3). I'm hoping to build some usecases to demonstrate this later.

The result is that I've rearchitected FuXi to work as a blackbox directly with a 4RDF Model instance (it is now query agnostic, so it can be plugged in as an extension library to any other/future RDF querying language bound to a 4RDF model). Prior to this version, it was extracting formulae statements by Versa queries instead of directly through the Model interfaces.

Right now I primarily use it through a single Versa function prospective-query. Below is an excerpt from the README.txt describing it's parameters:

prospective-query

prospective-query( ruleGraph, targetGraph, expr, qScope=None)

Using FuXi, it takes all the facts from the current query context (which may or may not be scoped) , the rules from the <ruleGraph> scope and invokes/executes the Rete reasoner. It adds the inferred statements to the <targetGraph> scope. Then, it performs the query <expr> within the <qScope> (or the entire model if None), removing the inferred statements upon exit


FuXi is is now a proper python package (with a setup.py) and I've moved it (permanently - I hope) to: http://copia.ogbuji.net/files/FuXi

I was a little unclear on Pychinko's specific dependencies with rdflib and cwm in my previous post, but Yarden Katz cleared up the confusion in his comments (thanks).

The installation and use of FuXi should be significantly easier than before with the recent inclusion of the N3 deserializer/parser into 4Suite.

Chimezie Ogbuji

via Copia

FuXi - Versa / N3 / Rete Expert System

Pychinko is a python implementation of the classic Rete algorithm which provides the inferencing capabilities needed by an Expert System. Part of Pychinko works ontop of cwm / afon out of the box. However, it's Interpreter only relies on rdflib to formally represent the terms of an RDF statement.

FuXi only relies on Pychinko itself, the N3 deserializer for persisting N3 rules, and rdflib's Literal and UriRef to formally represent the corresponding terms of a Pychinko Fact. FuXi consists of 3 components (in addition to a 4RDF model for Versa queries):

I. FtRdfReteReasoner

Uses Pychinko and N3RuleExtractor to reason over a scoped 4RDF model.

II. N3RuleExtractor

Extracts Pychinko rules from a scoped model with statements deserialized from an N3 rule document

III. 4RDF N3 Deserializer

see: N3 Deserializer

The rule extractor reverses the reification of statements contained in formulae/contexts as performed by the N3 processor. It uses three Versa queries for this

Using the namespace mappings:

Extract ancendent statements of logical implications

distribute(
  all() |- log:implies -> *,
  '.',
  '. - n3r:statement -> *'
)

Extract implied / consequent statements of logical implications

distribute(
  all() - log:implies -> *,
  '.',
  '. - n3r:statement -> *'
)

Extract the terms of an N3 reified statement

distribute(
  <statement>,
  '. - n3r:subject -> *',
  '. - n3r:predicate -> *',
  '. - n3r:object -> *'
)

The FuXi class provides methods for performing scoped Versa queries on a model extended via inference or on just the inferred statements:

For example, take the following fact document deserialized into a model:

@prefix : <http://foo/bar#> .
:chimezie :is :snoring .

Now consider the following rule:

@prefix ex: <http://foo/bar#> .
{?x ex:is ex:snoring} => {?x a ex:SleepingPerson} .

Below is a snapshot of Fuxi perforing the Versa query “type(ex:SleepingPerson)” on a model extended by inference using the above rule:

Who was FuXi? Author of the predecessor to the King Wen Sequence

Chimezie Ogbuji

via Copia

N3 Deserialization in 4RDF (and other possiblities)

Motivated by the idea that 4RDF (and 4Suite) could benefit greatly from being able to parse Notation 3 documents (see bottom), I attempted to write an N3 Deserializer for 4RDF that makes use of Sean B. Palmer's n3processor.

Ft.Rdf.Serializers.N3

The module simply needs to be added to 4RDF's Ft/Rdf/Serializers directory. I hesitate to check it in, since 4Suite is now in a feature-frozen beta release cycle. It implements a sink which captures generated triples and adds it to a 4RDF model:

class FtRDFSink:
  """
  A n3proc sink that captures statements produced from
   processing an N3 document
  """
  def __init__(self, scope,model):
     self.stmtTuples = []
     self.scope = scope
     self.model = model
     self.bnodes = {}
     self.resources = {}

  def start(self, root):
     self.root = root

  def statement(self, s, p, o, f):
     #First check if the subject is a bnode (via n3proc convention)
     #If so, use 4RDF's bnode convention instead
     #Use self.bnodes as a map from n3proc bnode uris -> 4RDF bnode urns
     if s[:2] == '_:':
        if s in self.bnodes:
           s = self.bnodes[s]
        else:
           newBNode = self.model.generateBnode()
           self.bnodes[s] = newBNode
           s = newBNode

     #Make the same check for the statement's object
     if o[:2] == '_:':
        if o in self.bnodes:
           o = self.bnodes[o]
        else:
           newBNode = self.model.generateBnode()
           self.bnodes[o] = newBNode
           o = newBNode

     #Mark the statement's subject as a resource (used later for objectType)
     self.resources[s] = None

     if f == self.root:
        #Regular, in scope statement
        stmt=(s,p,o,self.scope)
        self.stmtTuples.append(stmt)
     else:
        #Special case
        #This is where the features of N3 beyond standard RDF can be harvested
        #In particular, a statement with a different scope / context than
        #that of the containing N3 document is a 'hypothetical statement'
        #Such statement(s) are mostly used to specify impliciation via log:implies
        #Such implications rules can be persisted (by flattening the forumae)
        #and later interpreted by a backward-chaining inference process
        #triggered from Versa or from within the 4RDF Model retrieval interfaces
        #Forumulae are assigned a bnode uri by n3proc which needs to be mapped
        #to a 4RDF bnode urn
        if f in self.bnodes:
           f = self.bnodes[f]
        else:
           newBNode = self.model.generateBnode()
           self.bnodes[f] = newBNode
           f = newBNode

        self.resources[f] = None

        self.flatten(s, p, o, f)

  def flatten(self, s, p, o, f):
     """
     Adds a 'Reified' hypothetical statement (associated with the formula f)
     """
     fs = self.model.generateUri()
     self.stmtTuples.append((f,
                             N3R.statement,
                             fs,
                             self.scope))
     self.stmtTuples.append((fs,
                             N3R.subject,
                             s,
                             self.scope))
     self.stmtTuples.append((fs,
                             N3R.predicate,
                             p,
                             self.scope))
     self.stmtTuples.append((fs,
                             N3R.object,
                             o,
                             self.scope))

In addition, I made a patch to the 4RDF command that adds 'n3' as a input format. See my previous blog for an example of using this command to generate diagrams of 4RDF graphs.

For example, this diagram is of rdfs-rules - rendered via the 4rdf command line (patched in able to deserialize n3 documents)

Advantages

First, deserializing N3 will almost always be faster than deserializing from rdf/xml (especially for larger graphs) since it's a text parse vs an XML parse. So, if 4Suite repository XSLT Document Definitions are augmented to be able to deserialize into the model via n3, repository operations on documents with such Document Definition will be significanly faster.

Finally, by allowing the deserialization of SWAP constructs such as log:implies, formulae reification, existential and universal variables, reasoners capable of interpreting N3 rule semantics (such as Sean's pyrple Graph class - see a demonstration of it's inference capabilities) can perform inference externally (without having to build it into 4RDF or Versa) on a 4RDF store containing RDF deserialized from N3 documents with appropriate rules.

One thing to note about this implementation is that the default baseUri of N3 documents is http://nowhere when the specified scope is a urn (since the n3processor is unable to handle urn's). Otherwise, the given scope is used as the baseUri

[Chimezie Ogbuji]

via Copia