Why JSON vs XML is a yawn

Strange spate of discussion recently about XML vs. JSON. On M. David Peterson's Weblog he states what I think is the obvious: there is no serious battle between XML and JSON. They're entirely complementary. Mike Champion responds:

The same quite rational response could be given about the "war" between WS-* and REST, but that has caused quintillions of electrons to change state in vain for the last 5 years or so. The fact remains that some people with a strong attachment to a given technology howl when it is declared to be less than universal. I completely agree that the metaphor of "keep a healthy tool chest and use the right one for the job at hand" is the appropriate response to all these "wars", but such boring pragmatism doesn't get Diggs or Pagerank.

If I may be so bold as to assume that "pragmatism" includes in some aspect of its definition "that which works", I see a bit of a "one of these things is not like the other" game (sing along, Sesame Street kids) in Mike's comparison.

  • XML - works
  • JSON - works
  • REST - works
  • WS-Kaleidoscope - are you kidding me?

Some people claim that the last entry works, but I've never seen any evidence beyond the "it happened to my sister's boyfriend's roomate's cousin" variety. On the other hand, by the time you click through any ten Web sites you probably have hard evidence that the other three work, and in the case of REST you have that evidence by the time you get to your first site.

For my part, I'm a big XML cheerleader, but JSON is great because it gives people a place to go when XML isn't really right. There are many such places, as I've often said ("Should Python and XML Coexist?", "Alternatives to XML", etc.) Folks tried from the beginning to make XML right for data as well as documents, and even though I think the effort made XML more useful than its predecessors, I think it's clear folks never entirely succeeded. XML is much better suited to documents and text than records and data. The effort to make it more suitable for data leads to the unfortunate likes of WXS (good thing there's RELAX NG) and RDF/XML (good thing there's Turtle). Just think about it: XQuery for JSON. Wouldn't it be so much simpler and cleaner than our XQuery for XML? Heck, wouldn't it be pretty much...SQL?

That having been said there is one area where I see some benefit to XQuery. Mixed-mode data/document storage is inevitable given XML's impressive penetration. XQuery could be a thin layer for extracting data feeds from these mixed-mode stores, which can then be processed using JSON. If the XQuery layer could be kept thin enough, and I think a good architect can ensure this, the result could be a very neat integration. If I had ab initio control over such a system my preference would be schema annotations and super-simple RDF for data/document integration. After all, that's a space I've been working in for years now, and it is what I expect to focus on at Kadomo. But I don't expect to always be so lucky. Then again, DITA is close enough to that vision that I can be hopeful people are starting to get it, just as I'm grateful that the development of GRDDL means that people in the Semantic Web community are also starting to get it.

On the running code front I've been working on practical ways of working nicely with XML and JSON in tandem. The topic has pervaded several aspects of my professional work all at once in the past few months, and I expect to have a lot of code examples and tools to discuss here on Copia soon.

[Uche Ogbuji]

via Copia

Patterns and Optimizations for RDF Queries over Named Graph Aggregates

In a previous post I used the term 'Conjunctive Query' to refer to a kind of RDF query pattern over an aggregation of named graphs. However, the term (apparently) has already-established roots in database querying and has a different meaning that what I intended. It's a pattern I have come across often and is for me a major requirement for an RDF query language, so I'll try to explain by example.

Consider two characters, King (Wen) and his heir / son (Wu) of the Zhou Dynasty. Let's say they each have a FOAF graph about themselves and the people they know within a larger database which holds the FOAF graphs of every historical character in literature.

The FOAF graphs for both Wen and Wu are (preceeded by the name of each graph):

<urn:literature:characters:KingWen>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wen_of_Zhou> a :Person;
    :name “King Wen”;
    :mbox <mailto:kingWen@historicalcharacter.com>;
    rel:parentOf [ a :Person; :mbox <mailto:kingWu@historicalcharacter.com> ].

<urn:literature:characters:KingWu>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wu_of_Zhou> a :Person;
    :name “King Wu”;
    :mbox <mailto:kingWu@historicalcharacter.com>;
    rel:childOf [ a :Person; :mbox <mailto:kingWen@historicalcharacter.com> ].

In each case, Wikipedia URLs are used as identifiers for each historical character. There are better ways for using Wikipedia URLs within RDF, but we'll table that for another conversation.

Now lets say a third party read a few stories about “King Wen” and finds out he has a son, however, he/she doesn't know the son's name or the URL of either King Wen or his son. If this person wants to use the database to find out about King Wen's son by querying it with a reasonable response time, he/she has a few thing going for him/her:

  1. foaf:mbox is an owl:InverseFunctionalProperty and so can be used for uniquely identifying people in the database.
  2. The database is organized such that all the out-going relationships (between foaf:Persons – foaf:knows, rel:childOf, rel:parentOf, etc..) of the same person are asserted in the FOAF graph associated with that person and nowhere else.
    So, the relationship between King Wen and his son, expressed with the term ref:parentOf, will only be asserted in
    urn:literature:characters:KingWen.

Yes, the idea of a character from an ancient civilization with an email address is a bit cheeky, but foaf:mbox is the only inverse functional property in FOAF to use to with this example, so bear with me.

Now, both Versa and SPARQL support restricting queries with the explicit name of a graph, but there are no constructs for determining all the contexts of an RDF triple or:

The names of all the graphs in which a particular statement (or statements matching a specific pattern) are asserted.

This is necessary for a query plan that wishes to take advantage of [2]. Once we know the name of the graph in which all statements about King Wen are asserted, we can limit all subsequent queries about King Wen to that same graph without having to query across the entire database.

Similarly, once we know the email of King Wen's son we can locate the other graphs with assertions about this same email address (knowing they refer to the same person [1]) and query within them for the URL and name of King Wen's son. This is a significant optimization opportunity and key to this query pattern.

I can't speak for other RDF implementations, but RDFLib has a mechanism for this at the API level: a method called quads((subject,predicate,object)) which takes three terms and returns a tuple of size 4 which correspond to the all triples (across the database) that match the pattern along with the graph that the triples are asserted in:

for s,p,o,containingGraph in aConjunctiveGraph.quads(s,p,o):
  ... do something with containingGraph ..

It's likely that most other QuadStores have similar mechanisms and given the great value in optimizing queries across large aggregations of named RDF graphs, it's a strong indication that RDF query languages should provide the means to express such a mechanism.

Most of what is needed is already there (in both Versa and SPARQL). Consider a SPARQL extension function which returns a boolean indicating whether the given triple pattern is asserted in a graph with the given name:

rdfg:AssertedIn(?subj,?pred,?obj,?graphIdentifier)

We can then get the email of King Wen's son efficiently with:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?mbox
WHERE {
    GRAPH ?foafGraph {
      ?kingWen :name “King Wen”;
                       rel:parentOf [ a :Person; :mbox ?mbox ].
    }  
     FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?foafGraph) ).
}

Now, it is worth noting that this mechanism can be supported explicitly by asserting provenance statements associating the people the graphs are about with the graph identifiers themselves, such as:

<urn:literature:characters:KingWen> 
  :primaryTopic <http://en.wikipedia.org/wiki/King_Wen_of_Zhou>.

However, I think that the relationship between an RDF triple and the graph in which it is asserted, although currently outside the scope of the RDF model, should have it's semantics outlined in the RDF abstract syntax instead of using terms in an RDF vocabulary. The demonstrated value in RDF query optimization makes for a strong argument:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?kingWu,  ?sonName
WHERE {
    GRAPH ?wenGraph {
      ?kingWen :name “King Wen”;
                       :mbox ?wenMbox;
                       rel:parentOf [ a :Person; :mbox ?wuMbox ].
    }  
    FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?wenGraph) ).
    GRAPH ?wuGraph {
      ?kingWu :name ?sonName;
                     :mbox ?wuMbox;
                     rel:childOf [ a :Person; :mbox ?wenMbox  ].
    }  
     FILTER (rdfg:AssertedIn(?kingWu,:name,?sonName,?wuGraph) ).
}

Generally, this pattern is any two-part RDF query across a database (a collection of multiple named graphs) where the scope of the first part of the query is the entire database, identifies terms that are local to a specific named graph, and the scope of the second part of the query is this named graph.

Chimezie Ogbuji

via Copia

Comprehensive RDF Query API's for RDFLib

[by Chimezie Ogbuji]

RDFLib's support for SPARQL has come full circle and I wasn't planning on blogging on the developments until they had settled some – and they have. In particular, the last piece was finalizing a set of APIs for querying and result processing that fit well within the framework of RDFLib's various Graph API's. The other issue was for the query APIs to accomodate eventual support for other querying languages (Versa for instance) that are capable of picking up the slack where SPARQL is wanting (transitive closures, for instance – try composing a concise SPARQL query for calculating the transitive closure of a given node along the rdfs:subClassOf property and you'll immediately see what I mean).

Querying

Every Graph instance now has a query method through which RDF queries can be dispatched:

def query(self, strOrQuery, initBindings={}, initNs={}, DEBUG=False, processor="sparql")
    """
    Executes a SPARQL query (eventually will support Versa queries with same method) against this Conjunctive Graph
    strOrQuery - Is either a string consisting of the SPARQL query or an instance of rdflib.sparql.bison.Query.Query
    initBindings - A mapping from variable name to an RDFLib term (used for initial bindings for SPARQL query)
    initNS - A mapping from a namespace prefix to an instance of rdflib.Namespace (used for SPARQL query)
    DEBUG - A boolean flag passed on to the SPARQL parser and evaluation engine
    processor - The kind of RDF query (must be 'sparql' until Versa is ported)
    """

The first argument is either a query string or a pre-compiled query object (compiled using the appropriate BisonGen mechanism for the target query language). Pre-compilation can be useful for avoiding redundant parsing overhead for queries that need to be evaluated repeatedly:

from rdflib.sparql.bison import Parse
queryObject = Parse(sparqlString)

The next argument (initBindings) is dictionary that maps variables to their values. Though variables are common to both languages, SPARQL variables differ from Versa queries in that they are string terms in the form “?varName”, wherease in Versa variables are QNames (same as in Xpath). For SPARQL queries the dictionary is expected to be a mapping from variables to RDFLib terms. This is passed on to the SPARQL processor as initial variable bindings.

initNs is yet another top-level parameter for the query processor: a namespace mapping from prefixes to namespace URIs.

The debug flag is pretty self explanatory. When True, it will cause additional print statements to appear for the parsing of the query (triggered by BisonGen) as well as the patterns and constraints passed on to the processor (for SPARQL queries).

Finally, the processor specifies which kind of processor to use to evaluate the query: 'versa' or 'sparql'. Currently (with emphasis on 'currently'), only SPARQL is supported.

Result formats

SPARQL has two result formats (JSON and XML). Thanks to Ivan Herman's recent contribution the SPARQL processor now supports both formats. The query method (above) returns instances of QueryResult, a common class for RDF query results which define the following method:

def serialize(self,format='xml')

The format argument determines which result format to use. For SPARQL queries, the allowable values are: 'graph' – for CONSTRUCT / DESCRIBE queries (in which case a resulting Graph object is returned), 'json',or 'xml'. The resulting object also behaves as an iterator over the bindings for manipulation in the host language (Python).

Versa has it's own set of result formats as well. Primarily there is an XML result format (see: Versa by Example) as well as Python classes for the various internal datatypes: strings,resources,lists,sets,and numbers. So, the eventual idea is for using the same method signature for serializing Versa queries as XML as you would for SPARQL queries.

SPARQL Limitations

The known SPARQL query forms that aren't supported are:

  • DESCRIBE/CONSTRUCT (very temporary)
  • Nested Group Graph Patterns
  • Graph Patterns can only be used (once) by themselves or with OPTIONAL patterns
  • UNION patterns can only be combined with OPTIONAL patterns
  • Outer FILTERs which refer to variables within an OPTIONAL

Alot of the above limitations can be addressed with formal equivalence axioms of SPARQL semantics, such as those mentioned in the recent paper on the complexity and semantics of SPARQL. Since there is very little guidance on the semantics of SPARQL I was left with the option of implementing only those equivalences that seemed obvious (in order to support the patterns in the DAWG test suite):

1) { patternA { patternB } } => { patternA. patternB }
2) { basicGraphPatternA OPTIONAL { .. } basicGraphPatternB }
  =>
{ basicGraphPatternA+B OPTIONAL { .. }}

It's still a work in progress but has come quite a long way. CONSTRUCT and DESCRIBE are already supported by the underlying processor, I just need to find some time to hook it up to the query interfaces. Time is something I've been real short of lately.

Chimezie Ogbuji

via Copia

Corrections to RDF Query Language Comparison

I recently came upon this dated comparison of features in RDF querying languages. It predates SPARQL and as a result prompted Dan Connoly to attempt to demonstrate how SPARQL fares in this matrix. In reading it, I realized some of the features marked as 'No' under Versa are incorrect. So here is my attempt to demonstrate how (current Versa specification) would implement these requirements:

5 Quantification: Return the persons who are authors of all publications

The section name is misleading as it suggests the use of FOPL semantics to resolve a pattern that can be solved without FOPL semantics:

filter(
        'all()',
        'eq(length(difference(all()-dc:creator->*. - dc:creator->*)),0)'
    )

10 Namespace: Return all resources whose namespace starts with "http://www.aifb.uni-karlsruhe.de/".

filter(
        'all()',
        'starts-with("http://www.aifb.uni-karlsruhe.de/")'
    )

14 Entailment: Return all instances of that are members of the class Publication.

The current specification says (about rdf:type entailment):

Returns a list of all resources of a specified type, as defined by RDFS and optionally DAML schema specifications ...

So, the implementation has the option to account for entailment rules (as the 4Suite implementation does):

type(resource('#Publication'))

Chimezie Ogbuji

via Copia

Hooking up an IRC Agent to a Query Interface

Uche gave an excellent suggestion to augment Emeka to work with Triclops. After finishing Triclops, I had realized that most of the functionality Emeka provided was now redundant since it could be performed using Triclops (with the added advantage of being able to diagram/navigate RDF Graphs). Triclops URIs are unfortunately very long for queries submitted through HTTP GET. This is unfortunately unavoidable mostly because the parameters to the query service are Versa queries (which have their own syntax that includes characters that may not be very friendly for URIs) and URIs, so the prospect of submitting URIs to a service identified to a URI will probably always be problematic and might be as fundamental as submitting XML content through an XML-based messaging format. I'm digressing, but the solution that worked for me (which Uche suggested) was to make use of a URI 'shortening' service (such as MakeAShorterLink - which I used) to generate a 'hashed' or 'compacted' form of the URI from the parts submitted from , say, IRC. For example, lets say I was interested in Daniel Krech's redfoot framework and wanted to navigate the ontology (available from the URL: http://redfoot.net/2005/redfoot) as an RDF diagram in order to get an idea of the components and how they related. In particular, lets say I knew to start with the redfoot:Code class, I could interactively ask Emeka to return to execute the query:

redfoot:Code

against the URL which returns the RDF/XML serialization of the redfoot ontology. I could also ask Emeka to have the result rendered as a JPEG diagram with an associated image map that allows me to navigate to other aspects of the ontology to learn more about redfoot. Below is how I would do so with the new .triclops service:

<chimezie> .triclops "redfoot:Code" http://redfoot.net/2005/redfoot jpeg
<Emeka> Shortened link to Triclops request: http://makeashorterlink.com/?I1F7262EB

Below is the help associated with the triclops service.

<chimezie> .help triclops
<Emeka> .triclops "..Versa Query.." <source graph> <triples|jpeg|svg|versa>
<Emeka> The first argument is quoted Versa query to execute.  The second is a URL which points to the RDF graph (N3 or RDF/XML) to query against
<Emeka> The third argument is one of "triples", "jpeg", or "svg" and specifies how to return the query result
<Emeka> "triples" - raw triples in a tabled-grid, "jpeg" or "svg" - as navigable RDF graphs, and "versa" - raw Versa datatypes (rendered as html)
<Emeka> The result is a uri (courtesy of http://makeashorterlink.com) which redirects to the appropriate Triclops request

[Uche Ogbuji]

via Copia