SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

  • SPARQL.bgen (the main file that includes the others)
  • SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
  • SPARQLTokens.bgen.frag (Token definitions)
  • SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
  • SPARQLLexerPatterns.bgen.frag (lexer definition)
  • SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
  • SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

  • I wanted to get it through parsing the queries in the test case
  • Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
  • Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

  • Test round-tripping of SPARQL queries
  • Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia