Where does the Semantic Web converge with the Computerized Patient Record?

I've been thinking alot about the "Computer-based Patient Record: CPR", an acronym as unlikely as GRDDL but once again, a methodology expressed as an engineering specification. In both cases, the methodology is a mouthful, but a coherent architectural "style" and requires a mouthful of words to describe. Other examples of this:

  • Representation State Transfer
  • Rich Web Application Backplane
  • Problem-oriented Medical Record
  • Gleaning Resource Descriptions from Dialects of Languages

The term itself was coined (I think) by the Institute of Medicine [1]. If you are in healthcare and are motivated by the notion of using technology to make healthcare effective and inexpensive as possible, you should do the Institute a favor and buy the book:

National Institutute of Medicine, The Computer-Based Patient Record: An Essential Technology for Health Care - Revised Edition., 1998, ISBN: 0309055326.

I've written some recent slides that are on the W3C ESW 'wiki' which all have something to do with the idea in one way or another:

The nice thing about working in a W3C Interest Group is that the work you do is for the general publics benefit, so it is a manefestation of the W3C notion of the Semantic Web, which primarily involves a human social process.

Sorta like a technological manefestation of our natural darwinian instinct.

That's how I think of the Semantic Web, anyways: as a very old, living thread of advancements in Knowledge Representation which intersected with an anthropological assesment of some recent web architecture engineering standards.

Technology is our greatest contribution and so it sohould only make sense that wherer we use it to better our health it should not come as a cost to us. The slides reference and include a suggested OWL-sanctioned vocabulary for basically implementing the Problem-oriented Medical Record (a clinical methodology for problem solving).

I think the idea of a free (as in beer) vocabulary for people who need healthcare has an interesting intersection with the pragmatic parts of the Semantic Web (avoiding the double quotes) vision. I have exercised-induced asthma (or was "diagnosed" as such when I was younger). I still ran Track-and-Field in Highschool and was okay after an initial period where my lungs had to work overtime. I wouldn't mind hosting RDF content about such a "finding" if it was for my person benefit that a piece of software could do something useful for me in an automated, deterministic way.

"HL7 CDA" seems to be a freely avaiable, well-organized vocabulary for describing messages dispatched between hospital systems. And I recently wrote a set of XSLT templates which extract predicate logic statemnts about a CDA document using the POMR ontology and the other freely available "foundational ontologies" it coordinates. The CDA document on xml.coverpages.org has a nice concise description of the technological merits of HL7 CDA:

The HL7 Clinical Document Architecture is an XML-based document markup standard that specifies the structure and semantics of clinical documents for the purpose of exchange. Known earlier as the Patient Record Architecture (PRA), CDA "provides an exchange model for clinical documents such as discharge summaries and progress notes, and brings the healthcare industry closer to the realization of an electronic medical record. By leveraging the use of XML, the HL7 Reference Information Model (RIM) and coded vocabularies, the CDA makes documents both machine-readable (so they are easily parsed and processed electronically) and human-readable so they can be easily retrieved and used by the people who need them. CDA documents can be displayed using XML-aware Web browsers or wireless applications such as cell phones..."

The HL7 CDA was designed to "give priority to delivery of patient care. It provides cost effective implementation across as wide a spectrum of systems as possible. It supports exchange of human-readable documents between users, including those with different levels of technical sophistication, and promotes longevity of all information encoded according to this architecture. CDA enables a wide range of post-exchange processing applications and is compatible with a wide range of document creation applications."

A CDA document is a defined and complete information object that can exist outside of a messaging context and/or can be a MIME-encoded payload within an HL7 message; thus, the CDA complements HL7 messaging specifications.

If I could put up a CDA document describing the aspects of my medical history that were in my benefit to be freely available (at my discretion), I would do so in the event some piece of software could do some automated things for my benefit. Leveraging a vocabulary which essentially grounds an expressive variant of predicate logic in a transport protocol makes the chances that this happens, very likely. The effect is as multiplicative as the human population.

The CPR specification is also very well engineered and much ahead of its time (it was written about 15 years ago). The only technological checkmark left is a uniform vocabulary. Consensus stands in the way of uniformity, so some group of people need to be thinking about how the "pragmatic" and anthropological notions of the Semantic Web can be realized with a vocabulary about our personally controlled, public clinical content. Don't you think?

I was able to register the /cpr top level PURL domain and the URL http://purl.org/cpr/1.0/problem-oriented-medical-record.owl# resolves to the OWL ontology with commented imports to other very relevant OWL ontologies. Once I see a pragmatic demonstration of leaving owl:imports in a 'live' URL, I'll remove them. It would be a shame if any Semantic Web vocabulary terms came in conflict with a legal mandate which controlled the use of a vocabulary.

Chimezie Ogbuji

via Copia

Why JSON vs XML is a yawn

Strange spate of discussion recently about XML vs. JSON. On M. David Peterson's Weblog he states what I think is the obvious: there is no serious battle between XML and JSON. They're entirely complementary. Mike Champion responds:

The same quite rational response could be given about the "war" between WS-* and REST, but that has caused quintillions of electrons to change state in vain for the last 5 years or so. The fact remains that some people with a strong attachment to a given technology howl when it is declared to be less than universal. I completely agree that the metaphor of "keep a healthy tool chest and use the right one for the job at hand" is the appropriate response to all these "wars", but such boring pragmatism doesn't get Diggs or Pagerank.

If I may be so bold as to assume that "pragmatism" includes in some aspect of its definition "that which works", I see a bit of a "one of these things is not like the other" game (sing along, Sesame Street kids) in Mike's comparison.

  • XML - works
  • JSON - works
  • REST - works
  • WS-Kaleidoscope - are you kidding me?

Some people claim that the last entry works, but I've never seen any evidence beyond the "it happened to my sister's boyfriend's roomate's cousin" variety. On the other hand, by the time you click through any ten Web sites you probably have hard evidence that the other three work, and in the case of REST you have that evidence by the time you get to your first site.

For my part, I'm a big XML cheerleader, but JSON is great because it gives people a place to go when XML isn't really right. There are many such places, as I've often said ("Should Python and XML Coexist?", "Alternatives to XML", etc.) Folks tried from the beginning to make XML right for data as well as documents, and even though I think the effort made XML more useful than its predecessors, I think it's clear folks never entirely succeeded. XML is much better suited to documents and text than records and data. The effort to make it more suitable for data leads to the unfortunate likes of WXS (good thing there's RELAX NG) and RDF/XML (good thing there's Turtle). Just think about it: XQuery for JSON. Wouldn't it be so much simpler and cleaner than our XQuery for XML? Heck, wouldn't it be pretty much...SQL?

That having been said there is one area where I see some benefit to XQuery. Mixed-mode data/document storage is inevitable given XML's impressive penetration. XQuery could be a thin layer for extracting data feeds from these mixed-mode stores, which can then be processed using JSON. If the XQuery layer could be kept thin enough, and I think a good architect can ensure this, the result could be a very neat integration. If I had ab initio control over such a system my preference would be schema annotations and super-simple RDF for data/document integration. After all, that's a space I've been working in for years now, and it is what I expect to focus on at Kadomo. But I don't expect to always be so lucky. Then again, DITA is close enough to that vision that I can be hopeful people are starting to get it, just as I'm grateful that the development of GRDDL means that people in the Semantic Web community are also starting to get it.

On the running code front I've been working on practical ways of working nicely with XML and JSON in tandem. The topic has pervaded several aspects of my professional work all at once in the past few months, and I expect to have a lot of code examples and tools to discuss here on Copia soon.

[Uche Ogbuji]

via Copia

From Fourthought to Kadomo

I founded Fourthought in June, 1998 with three other friends from college. Eight and a half years doesn't sound that long when I say it, but the near-decade fills my rear view mirror so completely that I can scarcely remember having done anything before it. That's probably a good thing as it means I don't much remember the years of perfunctory consulting at places such as IBM Global Services and Sabre Decision Technologies prior to making the leap to relative independence. It was in part the typical entrepreneurial yen of the immigrant and in part the urge to chart my own high-tech career course that drove me to take the risk and endure the ups and downs of running a consultancy.

And I did say Fourthought is in the rear-view mirror. Last week I accepted a position at The Kadomo Group, a very young solutions company focused in the semantic Web space. Kadomo was founded by Eric miller, former Semantic Web Activity Lead at the W3C. Eric and I have always looked for ways we could work together considering our shared interest in how strategic elements of the semantic Web vision can be brought to bear in practice. He and the other bright and energetic folks coming together under the Kadomo banner were a major part of my decision to join. It was also made clear to me that I would have a sizeable role in shaping all aspects of the company. I would be able, and in fact encouraged to continue my leadership in open source projects and community specification development. Last but not least the culture of the company is set up to suit my lifestyle very well, which was always one tremendous benefit of Fourthought.

--> Without a doubt we have the seeds at Kadomo to grow something much greater than Fourthought was ever likely to be. The company has not neglected resources for high-caliber business development, operations nor marketing. Committing to these resources was something we always had a hard time doing at Fourthought, and this meant that even though we had brilliant personnel, strong client references and a market profile disproportionate to the resources we devoted to marketing, we were never able to grow at a fraction of our potential. I've learned many of these lessons the hard way, and it seems clearly to me that Kadomo is born to greater ambition. One good sign is that I'll just be Chief Technical architect, allowed to focus primarily on the company's technology strategy. I will not be stranded juggling primary sales, operations as well as lead consultant responsibilities. Another good sign is that product development is woven into the company's foundation, so I can look forward to greater leverage of small-company resources.

Considering my primary responsibility for technology strategy it may seem strange to some that I'd join a semantic Web company, knowing
that I have expressed such skepticism of the direction core semantic Web technology has taken lately. I soured on the heaping helping of gobbledygook that was laden on RDF in the post-2000 round of specs, I soured on SPARQL as a query language when it became clear that it was to be as ugly and inelegant as XQuery. There have been some bright spots of lightweight goodness such as GRDDL and SKOS but overall, I've found myself more and more focused on XML schema and transform technology. My departure point for the past few years has been that a well-annotated syntactic Web can meet all the goals I personally have for the semantic Web. I've always been pretty modest in what I want from semantics on the Web. To put it bluntly what interests me most is reducing the cost of screen-scraping. Of course, as I prove every day in my day job, even such an unfashionable goal leads to the sorts of valuable techniques that people prefer to buzz about using terms such as "enterprise mashups". Not that I begrudge folks their buzzwords, mind you.

I still think some simplified version or profile of RDF can be very useful, and I'll be doing what I can to promote a pragmatic approach to semantic Web at Kadomo, building on the mountains of XML that vendors have winked and nodded into IT and the Web, much of it a hopeless congeries. There is a ton of problem in this space, and I believe, accordingly, a ton of opportunity. I think mixing in my somewhat diffractive view of semantic Web will make for interesting discussion at Kadomo, and a lot of that will be reflected here on Copia, which, after all, I share with Chimezie, one of the most accomplished users of semantic Web technology to solve real-world problems.

One ongoing opportunity I don't plan to leave behind is my strong working relationship with the Web Platform Engineering group at Sun. With recent, hard-earned success in hand, and much yet to accomplish, we're navigating the paper trail to allow for a smooth transition from my services as a Fourthought representative to those as a Kadomo representative.

I hope some of you will consider contacting Kadomo to learn more about our services and solutions. We're just getting off the ground but we have a surprising amount of structure in place for bringing focus to our service offerings, and we have some exciting products in development of which you'll soon be hearing more. If you've found my writings useful or examples of my work agreeable, do keep me in mind as I plough into my new role.keep in touch-->.

Updated to reflect the final settling into Zepheira.  Most other bits are still relevant

[Uche Ogbuji]

via Copia

Atom Feed Semantics

Not a lot of people outside the core Semantic Web community actually want to create RDF, but extracting it from what's already there can be useful for a wide variety of projects. (RSS and Atom are first and relatively easy steps that direction.)

Terminal dump

chimezie@Zion:~/devel/grddl-hg$ python GRDDL.py --debug --output-format=n3 --zone=https:--ns=aowl=http://bblfish.net/work/atom-owl/2006-06-06/# --ns=iana=http://www.iana.org/assignments/relation/ --ns=some-blog=http://example.org/2003/12/13/  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
binding foaf to http://xmlns.com/foaf/0.1/
binding owl to http://www.w3.org/2002/07/owl#
binding iana to http://www.iana.org/assignments/relation/
binding rdfs to http://www.w3.org/2000/01/rdf-schema#
binding wot to http://xmlns.com/wot/0.1/
binding dc to http://purl.org/dc/elements/1.1/
binding aowl to http://bblfish.net/work/atom-owl/2006-06-06/#
binding rdf to http://www.w3.org/1999/02/22-rdf-syntax-ns#
binding some-blog to http://example.org/2003/12/13/
Attempting a comprehensive glean of  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
@@fetching:  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
@@ignoring types: ('application/rdf+xml', 'application/xml', 'text/xml', 'application/xhtml+xml', 'text/html')
applying transformation https://sommer.dev.java.net/atom/2006-06-06/transform/atom2turtle_xslt-1.0.xsl
@@fetching:  https://sommer.dev.java.net/atom/2006-06-06/transform/atom2turtle_xslt-1.0.xsl
@@ignoring types: ('application/xml',)
Parsed 22 triples as Notation 3
Attempting a comprehensive glean of  http://www.w3.org/2005/Atom

Via atom2turtle_xslt-1.0.xslt and Atom OWL: The GRDDL result document:

@prefix aowl: <http://bblfish.net/work/atom-owl/2006-06-06/#>.
@prefix iana: <http://www.iana.org/assignments/relation/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix some-blog: <http://example.org/2003/12/13/>.
[ a aowl:Feed;
     aowl:author [ a aowl:Person;
             aowl:name "John Doe"];
     aowl:entry [ a aowl:Entry;
             aowl:id "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
             aowl:link [ a aowl:Link;
                     aowl:rel iana:alternate;
                     aowl:to [ aowl:src some-blog:atom03]];
             aowl:title "Atom-Powered Robots Run Amok";
             aowl:updated "2003-12-13T18:30:02Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>];
     aowl:id "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
     aowl:link [ a aowl:Link;
             aowl:rel iana:alternate;
             aowl:to [ aowl:src <http://example.org/>]];
     aowl:title "Example Feed";
     aowl:updated "2003-12-13T18:30:02Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>].

Planet Atom's feed

@prefix : <http://bblfish.net/work/atom-owl/2006-06-06/#> .
 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 @prefix foaf: <http://xmlns.com/foaf/0.1/> .
 @prefix iana: <http://www.iana.org/assignments/relation/> .
 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[] a :Feed ;
:id "http://planetatom.net/"^^xsd:anyURI;
:title "Planet Atom" ;
:updated "2006-12-10T06:57:54.166890Z"^^xsd:dateTime;
:generator [ a :Generator;
            :uri <>;
            :generatorVersion "";
            :name """atomixlib"""];
 :entry [  a :Entry;
           :title "The Darfur Wall" ;
           :author [ a :Person; :name "James Tauber"] ;
           :link [ a :Link;
                     :rel iana:alternate ;
                     :to [ :src <http://jtauber.com/blog/2006/12/10/the_darfur_wall>;]          
           ];
:updated "2006-12-10T00:13:34Z"^^xsd:dateTime;
:published "2006-12-10T00:13:34Z"^^xsd:dateTime;
:id "http://jtauber.com/blog/2006/12/10/the_darfur_wall"^^xsd:anyURI; ]

[Uche Ogbuji]

via Copia

XML 2006 Synopsis: Are we there yet?

Well, XML 2006 came and went with a rather busy bang. My presentation on using XSLT to generate Xforms (from XUL/XHTML) was well attended and I hoped it helped increase awareness on the importance and value of XForms, (perhaps) the only comprehensive vehicle by which XML can be brought to the web in the way proponents of XML have had in mind for some time. As Simon puts it:

XML pretty (much) completely missed its original target market. SGML culture and web developer culture seemed like a poor fit on many levels, and I can't say I even remember a concerted effort to explain what XML might mean to web developers, or to ask them whether this new vision of the Web had much relationship to what they were doing or what they had planned. SGML/XML culture and web culture never really meshed.

Most of the questions I received had to do with our particular choice of FormsPlayer (an Internet Explorer plugin) instead of other alternatives such as Orbeon, Mozilla, Chiba, etc. This was a bit unfortunate and an indication of a much larger problem in this particular area of innovation we lovingly coin 'web 2.0'. I will get back to this later.

I was glad to hear John Boyer tell me he was pleasantly surprised to see mention of the Rich Web Application Backplane W3C Note. Mark Birbeck and Micah Dubinko (fellow XForms gurus and visionaries in their own rights) didn't let this pass over their radar, either.

I believe the vision outlined in that note is much more lucid than a lot of the hype-centered notions of 'web 2.0' which seem more focused on painting a picture of scattered buzzwords ('mash-ups', AJAX etc..) than commonalities between concrete architectures.

Though this architectural style accommodates solutions based on scripting (AJAX) as well as more declarative approaches, I believe the primary value is in freeing web developers from the 80% of scripting that is a result of not having an alternative (READ: browser vendor monopolies) than being the appropriate solution for the job. I've jousted with Kurt Kagle before on this topic and Mark Birkeck has written extensively on this as well.

In writing the presentation, I sort of stumbled upon some interesting observations about XUL and XForms:

  • XUL relies on a static, inarticulate means of binding components to their behavior
  • XForms relies on XPath for doing the same
  • XUL relies completely on javascript to define the behavior of it's widgets / components
  • A more complete mapping from XUL to XForms (than the one I composed for my talk) could be valuable to those more familiar with XUL as a bridge to XForms.

At the very least, it was a great way to familiarize myself with XUL.

In all, I left Boston feeling like I had experienced a very subtle anti-climax as far as innovation was concerned.
If I were to plot a graph of innovative progression over time, it would seem to me that the XML space has plateaued as of late and political in-fighting and spec proliferation has overtaken truly innovative ideas. I asked Harry Halpin about this and his take on it was that perhaps "XML has won". I think there is some truth to this, though I don't think XML has necessarily made the advances that were hoped in the web space (as Simon St. Laurent put it earlier).

There were a few exceptions however

XML Pipelines

I really enjoyed Norm Walsh's presentation on XProc and it was an example of scratching a very real itch: consensus on a vocabulary for XML processing workflows. Though, ironically, it probably wouldn't take much to implement in 4Suite as support for most (if not all) of the pipeline operations are already there.

I did ask Norm if XProc would support setting up XPath variables for operations that relied on them and was pleased to hear that they had that in mind. I also asked about support for non-standard XML operations such as XUpdate and was also pleased to hear that they had that covered as well. It was worth noting that XUpdate by itself could make the viewport operation rather redudant.

The Semantic Web Contingent

There was noticeable representation by semantic web enthusiasts (myself, Harry Halpin, Bob Ducharm, Norm Walsh, Elias Torres, Eric Prud'hommeux, Ralph Hodgson, etc..) and their presentations had somewhat subdued tones (perhaps) so as not to incite ravenous bickering from narrow-minded enthusiasts. There was still some of that however as I was asked by someone why RDF couldn't be persisted natively as XML, queried via XQuery, and inferred over via extension functions! Um... right... There is some irony in that as I have yet to find a legitimate reason myself to even use XQuery in the first place.

The common scenario is when you need to query across a collection of XML documents, but I've personally preferred to index XML documents with RDF content (extracted from a subset of the documents), match the documents via RDF, isolate a document and evaluate an XPath against it essentially bypassing the collection extension to XPath with a 'semantic' index. Ofcourse, this only makes sense where there is a viable mapping from XML to RDF, but where there is one I've preferred this approach. But to each his/her own..

Content Management API's

I was pleasantly surprised to learn from Joel Amousou that there is a standard (a datastore and language-agnostic? standard) for CMS APIs. called JSR-170. The 4Suite repository is the only Content Mangement System / API with a well though-out architecture for integrating XML & RDF persistence and processing in a way that emphasizes their strengths with regard to content management. Perhaps there is some merit in investigating the possibility of porting (or wrapping) the 4Suite repository API as JSR-170? Joel seems to think so.

Meta-stylesheets

Micheal Kay had a nice synopsis of the value of generating XSLT from XSLT – a novel mechanism I've been using for some time and it was interesting to note that one of his prior client projects involved a pipeline that started with an XForm, post-processed by XSLT and aggregated with results from an Xquery (also generated from XSLT).

Code generation is a valuable pattern that has plenty unrecognized value in the XML space and I was glad to see Micheal Kay highlight this. He had some choice words on when to use XSLT and when to use XQuery that I thought was on point: Use XSLT for re-purposing, formatting and use Xquery for querying your database.

GRDDL

Finally, I spent quite some time with Harry Halpin (chair of the GRDDL Working Group) helping him installing / using the 4Suite / RDFLib client I recently wrote for use with the GRDDL test suite. You can take what I say with a grain of salt (as I am a member and loud, vocal supporter), but I think that GRDDL will end up having the most influential impact in the semantic web vision (which I believe is much less important than the technological components it relies on to fulfill the vision) and XML adoption on the web than any other, primarily because it allows content publishers to leverage the full spectrum of both XML and RDF technologies.
Within my presentation, I mention an architectural style I call 'modality segregation' that captures the value proposition of XSLT for drawing sharp, distinguishable boundaries (where there were once none) between:

  • content
  • presentation
  • meaning (semantics)
  • application behavior

I believe it's a powerful idiom for managing, publishing, and consuming data & knowledge (especially over the web).

Harry demonstrated how easy it is to extract review data, vocabulary mappings, and social networks (the primary topic of his talk) from XHTML that would ordinarily be dormant with regards to everything other than presentation.
We ran into a few snafus with 4Suite when we tried to run Norm Walsh's hCard2RDF.xslt against Dan Connolleys web site and Harrys home page. We also ran into problems with the client (which is mostly compliant with the Working Draft).

I also had the chance to set Harry up with my blazingly fast RETE-based N3 reasoner, which we used to test GRDDL-based identity consolidation by piping multiple GRDDL results (from XHTML with embedded XFN) into the reasoner, performing an OWL DL closure, and identifying duplicate identities via Inverse Functional Properties (smushing)

As a result of our 5+ hour hackathon, I ended up writing 3 utilities that I hope to release once I find a proper place for them:

  • FOAFVisualizer - A command-line tool for merging and rendering FOAF networks in a 'controlled' and parameterized manner
  • RDFPiedPipe - A command-line tool for converting between the syntaxes that RDFLib supports: N3, Ntriples, RDF/XML
  • Kaleidos - A library used by FOAFVisualizer to control every aspect of how an RDF graph (or any other network structure) is exported to a graphviz diagram via BGL-Python bindings.

In the final analysis, I feel as if we have reached a climax in innovation only to face a bigger challenge from politics than anything else:

  • RDFa versus eRDF
  • SPARQL without entailment versus SPARQL with OWL entailment
  • XHTML versus HTML5
  • Web Forms versus XForms
  • Web 2.0 versus Web 3.0
  • AJAX versus XForms
  • XQuery versus XSLT
  • XQuery over RDF/XML versus SPARQL over abstract RDF
  • XML 1.0 specifications versus the new 2.0 specifications

The list goes on. I expressed my concerns about the danger of technological camp warefare to Liam Quin (XML Activity Lead) and he concurred. We should spend less time arguing over whether or not my spec is more l33t than yours and more time asking the more pragmatic questions about what solutions works best for the problem(s) at hand.

[Uche Ogbuji]

via Copia

A Relational Model for FOL Persistance

A short while ago I was rather engaged in investigating the most efficient way to persist RDF on Relational Database Management Systems. One of the outcomes of this effort that I have yet to write about is a relational model for Notation 3 abstract syntax and a fully funcitoning implementation - which is now part of RDFLib's MySQL drivers.

It's written in with Soft4Science's SciWriter and seems to render natively in Firefox alone (havne't tried any other browser)

Originally, I kept coming at it from a pure Computer Science approach (programming and datastructures) but eventually had to roll my sleeves and get down to the formal logic level (i.e., the Deconstructionist, Computer Engineer approach).

Partitioning the KR Space

The first method with the most impact was seperating Assertional Box statements (statements of class membership) from the rest of the Knowledge Base. When I say Knowledge Base, I mean a 'named' aggregation of all the named graphs in an RDF database. Partitioning the Table space has a universal effect on shortening indices and reducing the average number of rows needed to be scanned for even the worts case scenario for a SQL optimizer. The nature of RDF data (at the syntactic level) is a major factor. RDF is Description Logics-oriented representation and thus relies heavily on statements of class membership.

The relational model is all about representing everything as specific relations and the 'instanciation' relationship is a perfect candidate for a database table.

Eventually, it made sense to create additional table partitions for:

  • RDF statments between resources (where the object is not an RDF Literal).
  • RDF's equivalent to EAV statements (where the object is a value or RDF Literal).

Matching Triple Patterns against these partitions can be expressed using a decision tree which accomodates every combination of RDF terms. For example, a triple pattern:

?entity foaf:name "Ikenna"

Would only require a scan through the indices for the EAV-type RDF statements (or the whole table if neccessary - but that decision is up to the underlying SQL optimizer).

Using Term Type Enumerations

The second method involves the use of the enumeration of all the term types as an additional column whose indices are also available for a SQL query optimizer. That is:

ANY_TERM = ['U','B','F','V','L']

The terms can be partitioned into the exact allowable set for certain kinds of RDF terms:

ANY_TERM = ['U','B','F','V','L']
CONTEXT_TERMS   = ['U','B','F']
IDENTIFIER_TERMS   = ['U','B']
GROUND_IDENTIFIERS = ['U']
NON_LITERALS = ['U','B','F','V']
CLASS_TERMS = ['U','B','V']
PREDICATE_NAMES = ['U','V']

NAMED_BINARY_RELATION_PREDICATES = GROUND_IDENTIFIERS
NAMED_BINARY_RELATION_OBJECTS    = ['U','B','L']

NAMED_LITERAL_PREDICATES = GROUND_IDENTIFIERS
NAMED_LITERAL_OBJECTS    = ['L']

ASSOCIATIVE_BOX_CLASSES    = GROUND_IDENTIFIERS

For example, the Object term of an EAV-type RDF statment doesn't need an associated column for the kind of term it is (the relation is explicitely defined as those RDF statements where the Object is a Literal - L)

Efficient Skolemization with Hashing

Finally. thanks to Benjamin Nowack's related efforts with ARC - a PHP-based implementation of an RDF / SPARQL storage system, Mark Nottinghams suggestion, and an earlier paper by Stephen Harris and Nicholas Gibbins: 3store: Efficient Bulk RDF Storage, a final method of using a half-hash (MD5 hash) of the RDF identifiers in the 'statement' tables was employed instead. The statements table each used an unsigned MySQL BIGint to encode the half hash in base 10 and use as foreign keys to two seperate tables:

  • A table for identifiers (with a column that enumerated the kind of identifier it was)
  • A table for literal values

The key to both tables was the 16 byte unsigned integer which represented the half-hash

This ofcourse introduces a possibility of collision (due to the reduced hash size), but by hashing the identifier along with the term type, this further dilutes the lexical space and reduces this collision risk. This latter part is still a theory I haven't formally proven (or disproven) but hope to. At the maximum volume (around 20 million RDF assertions) I can resolve a single triple pattern in 8 seconds on an SGI machine and there is no collision - the implementation includes (disabled by default) a collision detection mechanism.

The implementation includes all the magic needed to generate SQL statements to create, query, and manage indices for the tables in the relational model. It does this from a Python model that encapsulates the relational model and methods to carry out the various SQL-level actions needed by the underlying DBMS.

For me, it has satisfied my needs for an open-source maximally efficient RDBM upon which large volume RDF can be persisted, within named graphs, with the ability to persist Notation 3 formulae in a seperate manner (consistent with Notation 3 semantics).

I called the Python module FOPLRelationModel because although it is specifically a relational model for Notation 3 syntax it covers much of the requirements for the syntactic representation of First Order Logic in general.

Chimezie Ogbuji

via Copia

What does GRDDL have to do with Intelligent Agents?

GRDDL. What is it? Why the long name? It does something very specific that requires a long name to describe it. Etymology of biological names includes examples of the same phenomenon in a different discipline. I starting writing on this weblog mainly as a way to regularly excercise my literary expression, so (to that end) I'm going to try to explain GRDDL in as few words as I can while simultaneously embelishing.

It is a language (or dialect) translator. It Gleans (gathers or harvests) Resource Descriptions. Resource Descriptions can be thought to refer to the use of constructs in Knowledge Representation. These constructs are often used to make assertions about things in sentence form - from which additional knowledge can be infered. However, it is also the 'Resource Description' in RDF (no coincidence there). RDF is the target dialect. GRDDL acts as an intelligent agent (more on this later) that performs translations from specific (XML) vocabularies, or Dialects of Languages to abstract RDF syntax.

Various languages can be used but there is a natural emphasis on a language (XSLT) with a native ability to process XML.

GRDDL is an XML & RDF formalism in what I think is a hidden pearl of web architecture: a well-engineered environment for distributed processing by intelligent agents. It's primarily the well-engineered nature of web architecture that lends the neccessary autonomy that intelligent agents require. Though hidden, there is much relevance with contemporaries, predecessors, and distant cousins:

It earns its keep mostly with small, well-designed XML formats. As a host language for XSLT it sets out to be (perhaps) a bridge across the great blue and red divide of XML & RDF. To quote a common parlance: watch this space.

 

Chimezie Ogbuji]

via Copia

Patterns and Optimizations for RDF Queries over Named Graph Aggregates

In a previous post I used the term 'Conjunctive Query' to refer to a kind of RDF query pattern over an aggregation of named graphs. However, the term (apparently) has already-established roots in database querying and has a different meaning that what I intended. It's a pattern I have come across often and is for me a major requirement for an RDF query language, so I'll try to explain by example.

Consider two characters, King (Wen) and his heir / son (Wu) of the Zhou Dynasty. Let's say they each have a FOAF graph about themselves and the people they know within a larger database which holds the FOAF graphs of every historical character in literature.

The FOAF graphs for both Wen and Wu are (preceeded by the name of each graph):

<urn:literature:characters:KingWen>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wen_of_Zhou> a :Person;
    :name “King Wen”;
    :mbox <mailto:kingWen@historicalcharacter.com>;
    rel:parentOf [ a :Person; :mbox <mailto:kingWu@historicalcharacter.com> ].

<urn:literature:characters:KingWu>

@prefix : <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.

<http://en.wikipedia.org/wiki/King_Wu_of_Zhou> a :Person;
    :name “King Wu”;
    :mbox <mailto:kingWu@historicalcharacter.com>;
    rel:childOf [ a :Person; :mbox <mailto:kingWen@historicalcharacter.com> ].

In each case, Wikipedia URLs are used as identifiers for each historical character. There are better ways for using Wikipedia URLs within RDF, but we'll table that for another conversation.

Now lets say a third party read a few stories about “King Wen” and finds out he has a son, however, he/she doesn't know the son's name or the URL of either King Wen or his son. If this person wants to use the database to find out about King Wen's son by querying it with a reasonable response time, he/she has a few thing going for him/her:

  1. foaf:mbox is an owl:InverseFunctionalProperty and so can be used for uniquely identifying people in the database.
  2. The database is organized such that all the out-going relationships (between foaf:Persons – foaf:knows, rel:childOf, rel:parentOf, etc..) of the same person are asserted in the FOAF graph associated with that person and nowhere else.
    So, the relationship between King Wen and his son, expressed with the term ref:parentOf, will only be asserted in
    urn:literature:characters:KingWen.

Yes, the idea of a character from an ancient civilization with an email address is a bit cheeky, but foaf:mbox is the only inverse functional property in FOAF to use to with this example, so bear with me.

Now, both Versa and SPARQL support restricting queries with the explicit name of a graph, but there are no constructs for determining all the contexts of an RDF triple or:

The names of all the graphs in which a particular statement (or statements matching a specific pattern) are asserted.

This is necessary for a query plan that wishes to take advantage of [2]. Once we know the name of the graph in which all statements about King Wen are asserted, we can limit all subsequent queries about King Wen to that same graph without having to query across the entire database.

Similarly, once we know the email of King Wen's son we can locate the other graphs with assertions about this same email address (knowing they refer to the same person [1]) and query within them for the URL and name of King Wen's son. This is a significant optimization opportunity and key to this query pattern.

I can't speak for other RDF implementations, but RDFLib has a mechanism for this at the API level: a method called quads((subject,predicate,object)) which takes three terms and returns a tuple of size 4 which correspond to the all triples (across the database) that match the pattern along with the graph that the triples are asserted in:

for s,p,o,containingGraph in aConjunctiveGraph.quads(s,p,o):
  ... do something with containingGraph ..

It's likely that most other QuadStores have similar mechanisms and given the great value in optimizing queries across large aggregations of named RDF graphs, it's a strong indication that RDF query languages should provide the means to express such a mechanism.

Most of what is needed is already there (in both Versa and SPARQL). Consider a SPARQL extension function which returns a boolean indicating whether the given triple pattern is asserted in a graph with the given name:

rdfg:AssertedIn(?subj,?pred,?obj,?graphIdentifier)

We can then get the email of King Wen's son efficiently with:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?mbox
WHERE {
    GRAPH ?foafGraph {
      ?kingWen :name “King Wen”;
                       rel:parentOf [ a :Person; :mbox ?mbox ].
    }  
     FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?foafGraph) ).
}

Now, it is worth noting that this mechanism can be supported explicitly by asserting provenance statements associating the people the graphs are about with the graph identifiers themselves, such as:

<urn:literature:characters:KingWen> 
  :primaryTopic <http://en.wikipedia.org/wiki/King_Wen_of_Zhou>.

However, I think that the relationship between an RDF triple and the graph in which it is asserted, although currently outside the scope of the RDF model, should have it's semantics outlined in the RDF abstract syntax instead of using terms in an RDF vocabulary. The demonstrated value in RDF query optimization makes for a strong argument:

BASE  <http://xmlns.com/foaf/0.1/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

SELECT ?kingWu,  ?sonName
WHERE {
    GRAPH ?wenGraph {
      ?kingWen :name “King Wen”;
                       :mbox ?wenMbox;
                       rel:parentOf [ a :Person; :mbox ?wuMbox ].
    }  
    FILTER (rdfg:AssertedIn(?kingWen,:name,”King Wen”,?wenGraph) ).
    GRAPH ?wuGraph {
      ?kingWu :name ?sonName;
                     :mbox ?wuMbox;
                     rel:childOf [ a :Person; :mbox ?wenMbox  ].
    }  
     FILTER (rdfg:AssertedIn(?kingWu,:name,?sonName,?wuGraph) ).
}

Generally, this pattern is any two-part RDF query across a database (a collection of multiple named graphs) where the scope of the first part of the query is the entire database, identifies terms that are local to a specific named graph, and the scope of the second part of the query is this named graph.

Chimezie Ogbuji

via Copia

What Do Closed Systems Have to Gain From SW Technologies?

Aaron Swartz asked that I elaborate on a topic that is dear to me and I didn't think a blog comment would do it justice, so here we are :)

The question is what do single-purpose (closed) databases have to gain from SW technologies. I think the most important misconception to clear up first is the idea that XML and Semantic Web technologies are mutually exclusive.
They are most certainly not.

It's not that I think Aaron shares this misconception, but I think that the main reason why the alternative approach to applying SW technologies that he suggests isn't very well spoken for is that quite a few on the opposing sides of the issue assume that XML (and it's whole strata of protocols and standards) and RDF/OWL (the traditionally celebrated components of SW) are mutually exclusive. There are other misconceptions that hamper this discussion, such as the assumption that the SW is an all or nothing proposition, but that is a whole other thread :)

As we evolve towards a civilization where the value in information and it's synthesis is of increasing importance, 'traditional' data mining, expressiveness of representation, and portability become more important for most databases (single-purpose or not).

These are areas that these technologies are meant to address, explicitly because “standard database” software / technologies are simply not well suited for these specific requirements. Not all databases are alike and so it follows that not all databases will have these requirements: consider databases where the primary purpose is the management of financial transactions.

Money is money, arithmetic is arithmetic, and the domain of money exchange and management for the most part is static and traditional / standard database technologies will suffice. Sure, it may be useful to be able to export a bank statement in a portable (perhaps XML-based) format, but inevitably the value in using SW-related technologies is very minimal.

Ofcourse, you could argue that online banking systems have a lot to gain from these technologies, but the example was of pure transactional management, the portal that manages the social aspects of money management is a layer on top.

However, where there is a need to leverage:

  • More expressive mechanisms for data collection (think XForms)
  • (Somewhat) unambiguous interpretation of content (think FOL and DL)
  • Expressive data mining (think RDF querying languages)
  • Portable message / document formats (think XML)
  • Data manipulation (think XSLT)
  • Consistent addressing of distributed resources (think URLs)
  • General automation of data management (think Document Definitions and GRDDL)

These technologies will have an impact on how things are done. It's worth noting that these needs aren't restricted to distributed databases (which is the other assumption about the Semantic Web - that it only applies within the context of the 'Web'). Consider the Wiki example and the advantages that Semantic Wikis have over them:

  • Much Improved possibility of data mining from more formal representation of content
  • 'Out-of-the-box' interoperability with tools that speak in SW dialects
  • Possibility of certain amount of automation from the capabilities that interpretation bring

It's also worth noting that recently the Semantic Wiki project introduced mechanisms for using other vocabularies for 'marking-up' content (FOAF being the primary vocabulary highlighted).

It's dually important in that 1) it demonstrates the value in incorporating well-established vocabularies with relative ease and 2) the policed way in which these additional vocabularies can be used demonstrate precisely the middle ground between a very liberal, open world assumption, approach to distributed data in the SW and controlled, closed, (single-purpose) systems approach.

Such constraints can allow for some level of uniformity that can have very important consequences in very different areas: XML as a messaging interlingua and extraction of RDF.

Consider the value in developing a closed vocabulary with it's semantics spelled out very unambiguously in RDF/RDFS/OWL and a uniform XML representation of it's instances with an accompanying XSLT transform (something the AtomOWL project is attempting to achieve).

What do you gain? For one thing, XForms-based data entry for the uniform XML instances and a direct, (relatively) unambiguous mapping to a more formal representation model – each of which have their own very long list of advantages they bring by themselves much less in tandem!

Stand-alone databases (where their needs intersect with the value in SW technologies) stand to gain: Portable, declarative data entry mechanisms, interoperability, much improved capabilities for interpretation and synthesis of existing information, increased automation of data management (by closing the system certain operations become much more predictable), and the additional possibilities for alternative reasoning heuristics that take advantage of closed world assumptions.

Chimezie Ogbuji

via Copia

Mapping Rete algorithm to FOL and then to RDF/N3

Well, I was hoping to hold off on the ongoing work I've doing with FuXi, until I could get a decent test suite working, but I've been engaged in several threads (older) that left wanting to elaborate a bit.

There is already a well established precedent with Python/N3/RDF reasoners (Euler, CWM, and Pychinko). FuXi used to rely on Pychinko, but I decided to write a Rete implementation for N3/RDF from scratch - trying to leverage the host language idioms (hashing, mappings, containers, etc..) as much as possible for areas where it could make a difference in rule evaluation and compilation.

What I have so far is more Rete-based than a pure Rete implementation, but the difference comes mostly from the impedance between the representation components in the original algorithm (which are very influenced by FOL and Knowledge Representation in general) and those in the semantic web technology stack.

Often with RDF/OWL, there is more talk than neccessary, so I'll get right to the meat of the semantic mapping I've been using. This assumes some familiarity with the original algorithm. Some references:

Tokens

The working memory of the network is fed by an N3 graph. Tokens represent the propagation of RDF triples (no variables or formulae identifiers) from the source graph through the rule network. These can represent token addition (where the triples are added to the graph - in which case the live network can be associated with a live RDF graph) and token removals (where triples are removed from the source graph). When tokens pass an alpha nodes intra-element test (see below) the tokens are passed on with a subtitution / mapping of variables in the pattern to the corresponding terms in the triples. This variable substitution is used to check for consistent variable bindings across beta nodes.

ObjectType Nodes and Working Memory

ObjectType nodes can be considered equivalent to the test for concept subsumption (in Description Logics). and therefore equivalent to the alpha node RDF pattern:

?member rdf:type ?klass.

'Classic' Alpha node patterns (the one below is taken directly from the original paper) map to multiple RDF-triple alpha node patterns:

(Expression ^Op X ^Arg2 Y)

Would be equivalent to the following triples patterns (the multiplicative factor is that RDF assertions are limited to binary predicates):

  • ?Obj rdf:type Expression
  • ?Obj Op ?X
  • ?Obj Arg2 ?Y

Alpha Nodes

Alpha nodes correspond to patterns in rules, which can be

  1. Triple patterns in N3 rules
  2. N-array functions.

Alpha node intra-element tests have a 'default' mechanism for matching triple patterns or they exhibit the behavior associated with a registered set of N-array functions - the core set coincide with those used by CWM/Euler/Pychinko (and often called N3 built-ins). Fuxi will support an extension mechanism for registering additional N-array N3 functions by name which associate them with a Python function that implements the constraint in a similar fashion to SPARQL extension functions. N-aray functions are automatically propagated through the network so they can participate in Beta Node activation (inter-element testing in Rete parlance) with regular triple patterns, using the bindings to determine the arguments for the functions.

The default mechanism is for equality of non-variable terms (URIs).

Beta Nodes

Beta nodes are pretty much verbatim Rete, in that they check for consistent variable substitution between their left and right memory.This can be considered similar to the unification routine common to both forward and backward chaining algorithms which make use of the Generalized Modus Ponens rule. The difference is that the sentences aren't being make to look the same but the existing variable substitutions are checked for consistency. Perhaps there is some merit in this similarity that would make using a Rete network to faciliate backward-chaining and proof generation an interesting possiblity, but that has yet to be seen.

Terminal Nodes

These correspond with the end of the LHS of a N3 rule, and is associated with the RHS and when 'activated' they 'fire' the rules, apply the propaged variable substitution, and add the newly inferred triples to the network and to the source graph.

Testing and Visualizing RDF/N3 Rete Networks

I've been able to adequately test the compilation process (the first of two parts in the original algorithm), using a visual aid. I've been developing a library for generating Boost Graph Library DiGraphs from RDFLib graphs - called Kaleidos. The value being in generating GraphViz diagrams, as well as access to a whole slew of graph heuristics / algorithms that could be infinitely useful for RDF graph analysis and N3 rule network analysis:

  • Breadth First Search
  • Depth First Search
  • Uniform Cost Search
  • Dijkstra's Shortest Paths
  • Bellman-Ford Shortest Paths
  • Johnson's All-Pairs Shortest Paths
  • Kruskal's Minimum Spanning Tree
  • Prim's Minimum Spanning Tree
  • Connected Components
  • Strongly Connected Components
  • Dynamic Connected Components (using Disjoint Sets)
  • Topological Sort
  • Transpose
  • Reverse Cuthill Mckee Ordering
  • Smallest Last Vertex Ordering
  • Sequential Vertex Coloring

Using Kaleidos, I'm able to generate visual diagrams of Rete networks compiled from RDF/OWL/N3 rule sets.

However, the heavy cost with using BGL is the compilation process of BGL and BGL python, which is involved if doing so from source.

Chimezie Ogbuji

via Copia