SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the setup.py had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

  • SPARQL.bgen (the main file that includes the others)
  • SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
  • SPARQLTokens.bgen.frag (Token definitions)
  • SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
  • SPARQLLexerPatterns.bgen.frag (lexer definition)
  • SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
  • SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

  • I wanted to get it through parsing the queries in the test case
  • Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
  • Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

  • Test round-tripping of SPARQL queries
  • Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia

My Definition of Semi-structured Data

Dissatisfied with the definitions of Semi-structured data I've found, I decided to roll for dolo:

Semi-structured data are data that are primarily not suitable for representation with the relational model. Often such data are better expressed by less restrictive forms of propositional logic which might be as simple (and open-ended) as a hieararchical model (XML for instance) or as complex as 'proper' First-order Logic.

Chimezie Ogbuji

via Copia

Cleveland Clinic Job Posting for Data Warehouse Specialist

Cleveland Clinic Foundation has recently posted a job position for a mid-level database developer with experience in semi-structured data binding, transformation, modeling, and querying. I happen to know that experience with the following are big pluses:

  • Python
  • XML data binding
  • XML / RDF querying (SPARQL,XPath,XQuery,etc..)
  • XML / RDF modelling
  • Programming database connections
  • *NIX System administration
  • Web application frameworks

You can follow the above link to the post an application.

Chimezie Ogbuji

via Copia

Practical Temporal Reasoning with Notation 3

Recently, Dan Brickley expressed an interest in the extent to which Bioinformatic research efforts are leveraging RDF for temporal reasoning (and patient healthcare record integration - in general). The thread on the value of modeling temporal relations explicitly versus relying on them being built into core RDF semantics left me feeling like a concrete example was in order.

We have a large (3500+ assertions) OWL Full ontology describing all the data we collect about Cardiothoracic procedures (the primary purpose of our database as currently constituted – in a relational model). There are several high-level classes we use to model concepts that, though core to our model, can be thought of as general enough for a common upper ontology for patient data.

One of the classes is ptrec:TemporalData (from here on out, I'll be using the ptrec prefix to describe vocabulary terms in our ontology) which is the ancestor of all classes that are expressed on an axis of time. We achieve a level of precision in modeling data on a temporal axis that enhances the kind of statistical analysis we perform on a daily basis.

In particular we use three variables:

  • ptrec:startDT (xs:dateTime)
  • ptrec:stopDT (xs:dateTime)
  • ptrec:instantDT (xs:dateTime)

The first two are used to describe an explicit (and 'proper') interval for an event in a patient record. This is often the case where the event in question only had a date associated with it. The latter variable is used when the event is instantaneous and the associated date / time is known.

The biggest challenge isn't simply the importance of time in asking questions of our data but of temporal factors that are keyed off specific, moving points of reference. For example, consider a case study on the effects of administering a medication within X days of specific procedure. The qualifying procedure is key to the observations we wish to make and behaves as a temporal anchor. Another case study interested in the effects of administering the same medication but with respect to a different procedure should be expected to rely on the same temporal logic – but keyed off a different point in time. However, by being explicit about how we place temporal data on a time axis (as instants or intervals) we can outline a logic for general temporal reasoning that can be used by either case study.

Linking into an OWL time ontology we can setup some simple Notation 3 rules for inferring interval relationships to aid such questions:

#Infering before and after temporal relationships (between instants and intervals alike)
{?a a ptrec:TemporalData;
    ptrec:instantDT ?timeA. 
 ?b a ptrec:TemporalData;
    ptrec:instantDT ?timeB. ?timeA str:greaterThan ?timeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB. ?startTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intAfter ?b.?b time:intBefore ?a}

#Infering during and contains temporal relationships (between proper intervals)
#Since there is no str:greaterThanOrEqual CWM function, the various permutations
#Are spelled out explicitely
{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

         => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:equalIgnoringCase ?startTimeB. ?stopTimeA str:greaterThan ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:TemporalData;
    ptrec:startDT ?startTimeB;
    ptrec:stopDT ?stopTimeB.
 ?startTimeA str:lessThan ?startTimeB. ?stopTimeA str:equalIgnoringCase ?stopTimeB} 

     => {?a time:intContains ?b.?b time:intDuring ?a}

Notice the value in xs:dateTime values being ordered temporally and as unicode, simultaneously. This allows us rely on str:lessThan and str:greaterThan for determining interval intersection and overlap.

Terms such as 'preoperative' (which refer to events that occurred before a specific procedure / operation) and 'postoperative' (events that occurred after a specific procedure / operation), which are core to general medical research nomenclature, can be tied directly into this logic:

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intBefore ?b}
   => {?a ptrec:preOperativeWRT ?b}

{?a a ptrec:TemporalData.  ?b a ptrec:Operation. ?a time:intAfter ?b}
   => {?a ptrec:postOperativeWRT ?b}

Here we introduce two terms (ptrec:preOperativeWRT and ptrec:postOperativeWRT) which relate temporal data with an operation in the same patient record. Using interval relationships as a foundation you can link in domain-specific, temporal vocabulary into your temporal reasoning model, and rely on a reasoner to setup a framework for temporal reasoning.

Imagine the value in using a backward-chaining prover (such as Euler) to logically demonstrate exactly why a specific medication (associated with the date when it was administered) is considered to be preoperative with respect to a qualifying procedure. This would complement the statistical analysis of a case study quite nicely with formal logical proof.

Now, it's worth noting that such a framework (as it currently stands) doesn't allow precision of interval relationships beyond simple intersection and overlap. For instance, in most cases you would be interested primarily in medication administered within a specific length of time. This doesn't really impact the above framework since it is no more than a functional requirement to be able to perform calendar math. Imagine if the built-in properties of CWM were expanded to include functions for performing date math. for instance:

With such a function we can expand our logical framework to include more explicit temporal relationships.
For example, if we only wanted to consider medications that were done 30 days prior to an operation to be considered 'preoperative':

{?a a ptrec:TemporalData;
    ptrec:startDT ?startTimeA;
    ptrec:stopDT ?stopTimeA.  
 ?b a ptrec:Operation;
    ptrec:startDT ?opStartTime;
    ptrec:stopDT ?opStopTime.  
 ?a time:intBefore ?b.
 (?stopTime "-P30D") time:addDT ?preOpMin. ?stopTimeA str:lessThan ?preOpMin}
    => {?a ptrec:preOperativeWRT ?b}

It's worth noting that such an addition (to facilitate calendar math) would be quite useful as a general extension for RDF processors.

For the most part, I think a majority of the requirements needed for temporal reasoning (in any domain) can be accommodated by explicit modeling, because FOPL (the foundation upon which RDF is built) was designed to be expressive enough to represent all human concepts.

Chimezie Ogbuji

via Copia

"Thinking XML: Good advice for creating XML"

An earlier article (published in January) that I forgot to announce:

"Thinking XML: Good advice for creating XML"

Subtitle: Principles of XML design from the community at large
Synopsis: The use of XML has become widespread, but much of it is not well formed. When it is well formed, it's often of poor design, which makes processing and maintenance very difficult. And much of the infrastructure for serving XML can compound these problems. In response, there has been some public discussion of XML best practices, such as Henri Sivonen's document, "HOWTO Avoid Being Called a Bozo When Producing XML." Uche Ogbuji frequently discusses XML best practices on IBM developerWorks, and in this column, he gives you his opinion about the main points discussed in such articles. [Also discusses "Monastic XML," by Simon St. Laurent.]

[Uche Ogbuji]

via Copia

"Thinking XML: Review of RFC 3470: Guidelines for the use of XML"

"Thinking XML: Review of RFC 3470: Guidelines for the use of XML"

Thinking XML author Uche Ogbuji continues with the theme of XML best practices. In the previous installment "Good advice for creating XML," you looked at XML design recommendations from experts. In this article, you'll find recommendations from the Internet Engineering Task Force (IETF), an organization whose technical papers drive most Internet protocols. The IETF's XML recommendations are gathered together in RFC 3470: "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols."

[Uche Ogbuji]

via Copia

"Tip: Remove sensitive content from your XML samples with XSLT"

"Tip: Remove sensitive content from your XML samples with XSLT"

Do you need to share samples of your XML code, but can't disclose the data? For example, you might need to post a sample of your XML code with a question to get some advice with a problem. In this tip, Uche Ogbuji shows how to use XSLT to remove sensitive content and retain the basic XML structure.

I limited this article to erasing rather than obfuscating sensitive content, which can be done with XSLT 1.0 alone. With EXSLT (or even XSLT 2.0) you can do some degree of obfuscation, allowing you to possibly preserve elements of character data that are important to the problem under discussion. Honestly, though, I prefer to solve this problem with even more flexible tools. As a bonus the following is a bit of 4Suite/SAX code that uses a SAX filter to obfuscate character data by adding a random shift to the ordinal of each character in the Unicode alphanumeric class. This way if exotic characters were part of the problem you're demonstrating, they'd be left alone. It's easy to use the code as a template, and usually all you have to change is the obfuscate function or the obfuscate_filter class in order to fine-tune the workings.

import re
import random
from xml.sax import make_parser, saxutils
from Ft.Xml import CreateInputSource, Sax

RANDOM_AMP = 15
ALPHANUM_PAT = re.compile('\w', re.UNICODE)

def obfuscate(old):
    def mutate(c):
        return unichr(ord(c.group())+random.randint(-RANDOM_AMP,RANDOM_AMP))
    return ALPHANUM_PAT.subn(mutate, old)[0]

class obfuscate_filter(saxutils.XMLFilterBase):
    def characters(self, content):
        saxutils.XMLFilterBase.characters(self, obfuscate(content))
        return

if __name__ == "__main__":
    XML = "http://cvs.4suite.org/viewcvs/*checkout*/Amara/demo/labels1.xml"
    parser = make_parser(['Ft.Xml.Sax'])
    filtered_parser = obfuscate_filter(parser)
    handler = Sax.SaxPrinter()
    filtered_parser.setContentHandler(handler)
    filtered_parser.parse(CreateInputSource(XML))

This code uses recent fixes and capabilities I checked into 4Suite CVS last week. I think all the needed details to understand the code are in the SAX section of the updated 4Suite docs, which John Clark has already posted.

[Uche Ogbuji]

via Copia

Just a friendly game of mailbox baseball

Baseball was never for Blacks
It used to be a pastime for Whites
Now it has mad Puerto Ricans
But that's not the point of this song.
The point of this song, and I make it mad simple when I be flipping this script
Is that the industry is all over the mound pitching but nobody's making any hits.
Hmmmmmm.

—Natural Resources—Negro League Baseball

Yeah. Mailbox baseball. That game of legend (I've never seen it played) where a pack of stereotypical American teenage lombards drives down the boulevard whacking at mailboxes with Louisville sluggers from the car windows until they hit one with unclaimed mail, enjoying the resulting shower of fluttering letters. I did something like that to myself today, quite unwittingly. (I'm amazed there's room in my cheek for the tongue after a day like today).

I've been completely retooling my e-mail habits, in part to use uche@ogbuji.net more for professional correspondence that does not have immediate bearing on my day job. I the process I managed to completely kill my fourthought.com address today, and as I hear it, it's all been bouncing to hell. Sorry folks. Unless you work for a company that writes checks to Fourthought, Inc., I'll probably be asking you politely to start using my ogbuji.net address from now on anyway, so you might as well start now (that address wasn't affected by the outage). Meanwhile I've straightened out the config problems, and as soon as DNS has a chance to propagate, my fourthought.com address should be working again.

BTW, those who have ever sent me e-mail at my ogbuji.net account and learned how infrequently I get around to checking it will notice a marked improvement in my attention (I should temper that promise by mentioning that I haven't been able to keep up with even my fourthought.com address in years. It's no fun being afflicted by chronic Scrolliosis).

Back-to-off-topic note w.r.t. my silly intro: If you're an underground hip-hop head and haven't heard the song whose lyrics I used you'd best get up out there and find that single. It's a classic from 1997, and it was the first time I heard the M.C. then known as "What What?" and now recognized as the great Jean Grae (repping South Africa—she runs through your hood with her middle finger up.). It's a great romp of a song that refuses to take itself too seriously, and has the tickly lounge piano in the background loop to match. It's da Jawn!

[Uche Ogbuji]

via Copia

Extension Functionality and Set Manipulation in RDF Query Languages

A recent bit by Andy Seaborne (on Property Functions in ARQ – Jena's query engine) got me thinking about general extension mechanisms for RDF querying languages.
In particular, he mentions two extensions that provide functionality for processing RDF lists and collections which (ironically) coincide with functions I had requested be considered for later generations of Versa.

The difference, in my case, was that the suggestions were for functions that cast RDF lists into Versa lists (or sets) – which are data structures native to Versa that can be processed with certain built-in functions.

Two other extensions I use quite often in Versa (and mentioned briefly in my XML.com article) are scope, and scoped-subquery. These have to do with identifying the context of a resource and limiting the evaluation of a query to within a named graph, respectively. Currently, the scoped function returns a list of the names of all graphs in which the resource is asserted as a member of any class (via rdf:type). I could imagine this being expanded to include the names of any graph in which statements about the resource are asserted. scoped-subquery doesn't present much value for a host language that can express queries as limited to a named context.

I also had some thoughts about an extension function mechanism that allowed an undefined function reference (for functions of arity 1 – i.e. functions that take only a single argument) to be interpreted as a request for all the objects of statements where the predicate is the function URI and the subject is the argument

I recently finished writing a SPARQL grammar for BisonGen and hope to conclude that effort (at some point) once I get over a few hurdles. I was pleasantly surprised to find that the grammar for function invocation is pretty much identical for both query languages. Which suggests that there is room for some thought about a common mechanism (or just a common set of extension functionality – similar to the EXSLT effort) for RDF querying or general processing.

CWM has a rich, and well documented set of RDF extensions. The caveat is that the method signatures are restricted to dual input (subject and object) since the built-ins are expressed as RDF triples where the predicate is the name of the function and the subject and object are arguments to is. Nevertheless, it is a good source from which an ERDF specification could be drafted.

My short wish-list of extension functions in such an effort would include:

  • List comprehension (intersection, union, difference, membership, indexing, etc.)
  • Resolving the context for a given node: context(rdfNode) => URI (or BNode?)
  • an is-a(resource) function (equivalent to Versa's type function without the entailment hooks)
  • a class(resource) which is an inverse of is-a
  • Functions for transitive closures and/or traversals (at the very least)
  • A fallback mechanism for undefined functions that allowed them to be interpreted as unary 'predicate functions'

Of course, functions which return lists instead of single resources would be problematic for any host language that can't process lists, but it's just some food for thought.

Chimezie Ogbuji

via Copia

Gooner luv (or "Aaaaaaaarsenal")

IMG_1108 For a while, whenever my three year old, Jide, came across football on the TV, he'd jump up and down yelling "Aaaaaaarsenal. Ar-SE-nal!" regardless of who was actually playing. He was just imitating his Dad. I've surprised even myself at how quickly and surely I've become an Arsenal maniac. This is so even though the first EPL game I ever watched live was Tottenham vs. Coventry City at White Hart Lane. It was the best deal I could get for a London game. I liked Arsenal even then, and knew that Spurs were derby rivals, but I had no idea how intense the rivalry was until I heard the anti-Arsenal songs (never mind that Moustapha Hadji and Coventry City was the problem Spurs fans should have been addressing) and saw the very aggressive tee-shirts (I'll forever be disturbed by the one of the Spurs mascot raping the Arsenal mascot). I decided to affect Tottenham color for the day, just for my own safety. I thus have a Spurs scarf to go with my many bits of Arsenal gear and I can always entertain my brit friends by wearing the two teams' colors together (My colleague Kenny Akpan, a Spurs fan, told me: "You do know you could be beaten up twice dressing like that in London, mate"). Yeah, Kenny. I know.

I did later visit Highbury to watch The Arsenal versus Leeds in a losing effort for the Gunners (undone by a magnificent Ian Harte free kick). This time I didn't feel queer siding with the sea of fans around me. Confirmation of my deep red blood. It's been The Arsenal without the slightest stint since then.

The gunner-side leaning started with Ian Wright and gang, back when it was quite hard to catch EPL in the U.S. I had to economize the efforts to find matches. I didn't really have a club affiliation (I enjoyed playing since I was a kid, but never got into watching until college age) but naturally I always plumped for the big names-- Manchester United or Arsenal games. I also looked for Liverpool games because my closest cousins are all Liverpool supporters (their family tradition since my Uncle took an interest in the career of John Barnes), but Liverpool haven't been all that fun to watch for a while. Interestingly enough I also came to enjoy watching Chelsea because of Nigerian Celestine Babayaro, Zola's brilliance, and the constant drama of Tore André Flo getting subbed in with five minutes to go and still finding a way to score. The Chelski era, especially under Mourinho, has put a firm end to that.

But Arsenal always took the cake. In the past decade of avid watching one thing I've come to expect from The Arsenal is that they're always easy on the eyes, even in a losing effort. Indeed it seems to me that most of the time when they lose it's just because they insist on the perfect goal to the detriment of the actual score line. I think probably the only time I've seen them played off the park was in the 2005 FA Cup final versus Man U (I hated watching that game), and they still somehow managed to win. Chelsea did also give them quite a run-around earlier this year as well, I admit. Both cases were Arsenal at their nadir, in the process of their current transformation to renewed brilliance.

I just love the new guard. If it's possible to be more skillful than Dennis Bergkamp, van Persie is definitely staking a claim. Adebayor is like Kanu, but seems more reliable on goal (I guess that is to say he's like Kanu playing for country rather than club). Fabregas isn't as tough as Viera, but makes up for it by complete mastery of middle of the pitch, passing as if he has a circle of eyes full round his head. Touré can pretty much shut down anyone in the world, and Eboué is showing signs of the same ability. Senderos is ponderous, but at least careful. The only one I haven't warmed up to is Reyes, whom I find to be skillful enough, but very rarely a true inspiration (it doesn't help that he is wont to dive, and I hate divers). As for the old guard, Henry is beyond superlatives, Pirès is still the second deadliest dart in any sheaf, and Lehman has shaken off his erratic form of a year ago.

Sporting away colors Dwelling on Henry for a bit, I hope for everyone's sake that he doesn't leave. It's not just that he's good for Arsenal, but also that Arsenal is good for him. Any observer can see that Henry thrives in an atmosphere of finesse. He flourishes when the Arsenal midfield is in the throes of their exquisite one and two-touch passing rallies, or on their lightning quick counter- attack. When the opposing team succeeds in dropping lead into Arsenal midfielders' shorts, Henry tends to completely disappear from the game. He's not a Drogba or van Nistelrooy, who are happy to order their back line to completely bypass a beaten midfield and switch to route one football. This is where the false accusation comes from that Henry is not a big game player. In fact, Henry thrives in big games, as long as his midfield is thriving, as well.

Anyway, I just can't see Henry getting that steady diet of joga bonito anywhere else. OK. OK. Barcelona (another favorite club of mine), and to be fair, that's the club he's been most often linked with. But even though Barcelona's midfield could support Henry almost as well as Arsenal's, he would never dominate the Barcelona attack as thoroughly as he does Arsenal's. Not while Ronaldinho and Eto'o are there, and perhaps not even while Messi is there. At Arsenal, Henry is the inevitable cap on a run of midfield brilliance. At Barcelona he'd have to fight tooth and nail for his share of the finishing glory. And let's be honest. Fighting tooth and nail is not something that suits Henry's disposition. Personally, I think that if he leaves Arsenal, he will never really be the same. Of course neither would Arsenal. Here's to Henry in red, er, claret strip forever.

It's definitely a swish time to be an Arsenal fan. It's been fun watching them humble fellow European giants Real Madrid and Juventus, surpassing all previous Champions league progress. Gunners for the Champion's league cup. This year and next. (There's a slim chance--requiring a complete Liverpool meltdown--for a Champion's league tie between Gunners and Spurs next year. Now wouldn't that be tasty?) Oh, and beat Man U tomorrow.

[Uche Ogbuji]

via Copia