Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
Options:
 -h, --help            show this help message and exit
 -q QUOTECHAR, --quoteChar=QUOTECHAR
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
 -o FILEPATH, --output=FILEPATH
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   cnt=Counter()
   @coroutine
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               cnt.counter+=1
           else:
               rt=[]
               badChars = False
               for binding in node.binding:
                   try:
                       rt.append(U(binding).encode('ascii'))
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
               csvWriter.writerow(rt)
       return
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   target.close()
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

My Thoughts after 7 Years of Being a Web-based Patient Registry Architect

Monday January 17th, 2011 is my last day at the Cleveland Clinic where I was Lead Systems Analyst for 7 years working on a very exciting project with a goal to replace a relational Cardiovascular Information Registry that supported research in cardiovascular medicine and surgery and consolidate data management for research purposes in the surgery department.  Eventually, our Clinical Investigation unit became part of the Heart and Vascular Institute.  

The long-term goal was to create a framework for context-free data management systems in which expert-provided, domain-specific knowledge is used to control all aspects of data entry, storage, display, retrieval, communication, and formatting for external systems.  By ‘context-free’, I mean that the framework can be used for any domain (even outside of medicine) and nothing about the domain is assumed or hardcoded.  The use of metadata was envisioned as key to facilitating this capability and to this end the use of RDF was effective as a web and logic-based knowledge representation.

At the time, I was unemployed soon after the post 9-11 dot.com economic and innovation bubble wherein there was great risk aversion to using the emerging technologies of the time: XML, RDF, XSLT, Python, REST, etc.  I was lucky that in the city whereI was born, a couple of miles (literally) from where I was born and my mother worked there was a great job opportunity under a mandate from our director for a de novo, innovative architecture.  I went for the interview and was fortunate to get the job.

Four years later (in the fall of 2007), It was deployed for production use by nurses, coders, and researchers and built on top of the predecessor of Akara, an XML & RDF content repository:  

The diagram above documents the application server and services stack that is used by the various components that interface with the repository.  The web services paradigm was never used and most of the service-oriented architecture was pure HTTP with XML (i.e., POX/HTTP)

Mozilla’s Firefox with XForms extension was used for about 5 years for the complete collection of longitudinal patient record content for a little over 200,000 patients who had operations with cardiac(-thoracic) surgical component(s) in a registry.  The architectural methodology was such that the use and deployment of infrastructure (entire data collection screens, XML and RDF schemas, data transformations, etc.) was significantly automated.  W3C document management and semantic web representation standards (HTTP, RDF, XML, N3, OWL, and SPARQL) were used to ensure interoperability, extensibility, and automation, in particular as infrastructure for both (certified) quality measure reporting and a clinical research repository.

I wanted to take the time to coalesce my experience in working on this project and share some of the key desiderata, architectural constraints (that perhaps comprise an architectural style), and opportunities in using these emerging technologies to address the primary engineering challenges of clinical research repositories and patient registries.  The following functionalities stood out as very implementable for such an approach: 

  • patient record abstraction (data entry via knowledge-generated input screens)
  • workflow management
  • data quality management
  • identification of patients cohorts for study
  • data export for statistical analysis

Patient Record Abstraction

XForms brings a rich history of browser-based data entry to bear as comprehensive, declarative syntax that is part of a new architectural paradigm and works well with the architectural style of the World Wide Web: REST.  It abstracts widgets, controls, data bindings, logic, remote data management, integration into a host language, and other related rich internet application requirements.  

I had to check the way back machine but found the abstract of the 2006 presentation: “The Essence of Declarative, XML-based Web Applications: XForms and XSLT”.  There I discussed best practices, common patterns, and pitfalls in using XSLT as a host language for generating web-based user interfaces expressed in XForms.  The XForms-generating infrastructure was quite robust and everything from screen placement, behavior, range checking, drop-down lists, and data field dependencies were described in an XML document written using a microformat designed for templating user interfaces for patient registry data collection.

All in all (and I hope John writes about this at some point since he was the primary architect of the most advanced manifestation of this approach), the use of a microformat for documenting and generating an XForms framework for controlled, validated, form-based data collection was very adequate and sufficed to allow the (often unpredictable) requirements of national registries reporting and our clinical studies to dictate automatically deployed changes to a secure (access controlled) patient record web portal and repository.

Regarding quality management, on December 2007, John and I were supposed to present at XML 2007 about how

XForms provides a direct interface to the power of XML validation tools for immediate and meaningful feedback to the user about any problems in the data. Further, the use of validation components enables and encourages the reuse of these components at other points of entry into the system, or at other system boundaries.

I also found the details of the session (from the online conference schedule) on the wayback machine from the XML 2007 conference site.  The presentation was titled: “Analysis of an architecture for data validation in end-to-end XML processing systems.

Client-side XForms constraint mechanisms as well as server-side schematron validation was the basis for quality management at the point of data entry (which is historically the most robust way to address errors in medical record content).  All together, the Mozilla browser platform presents quite an opportunity to offload sophisticated content management capabilities from the server to the client and XML processing and pipelining played a major role in this regard.  

Workflow Management

See: A Role for Semantic Web Technologies in Patient Record Data Collection where I described my chapter in the LInked Enterprise Data book regarding the use of semantic web technologies to facilitate patient record data collection workflow.  In the Implementation section, I also describe how declarative AJAX frameworks such as Simile Exhibit were integrated for use in faceted browsing and vizualization of patient record content.  RDF worked well as the state machine of a workflow engine.  The digital artifacts involved in the workflow application and the messages sent back and forth from browser to server are XML and JSON documents.  The content in the documents are mirrored into an RDF dataset describing the state of the workflow task which can be queried when listing the workflow tasks associated with the currently logged in user (for instance).  

This is an archetype of an emerging pattern where XML is used as the document and messaging syntax and a semantics-preserving RDF rendering of the XML content (mirrored persistently in an RDF dataset) is used as the knowledge representation for inference and querying.  More on this archetype later.  

Identification of Patient Cohorts for Study

Eric Prud’hommeaux has been doing alot of excellent infrastructure work (see: SWObjects) in the Semantic Web for Healthcare and Life Sciences Interest Group around the federated use of SPARQL for querying structured, discrete EHR data in a meaningful way and in a variety of contexts: translational research, clinical observations interoperability, etc.  His positive experience has mirrored ours in the use of SPARQL as the protocol for querying a patient outcome registry and clinical research database.

However, we had the advantage of already having the data natively available as a collection of RDF graphs (each of which describes a longitudinal patient record) from the registry.  A weekly ETL-like process recreates an RDF dataset from the XML collection and serves as the snapshot data warehouse for the operational registry, which relies primarily on XML payload for the documentation, data collection, and inter-system messaging needs.  Other efforts have been more focused on making extant relational EHR data available as SPARQL.  

This was the backdrop to the research we did with Case Western Reserve University P.h.D students in the summer and fall of 2008 on the efficient use of relational algebra to evaluate the SPARQL language in its entirety.  

GRDDL Mechanisms and Dual Representation

Probably the most substantive desiderata or opportunity in patient record systems (and registries) of the future is as hard to articulate as it is to (frankly) appreciate and understand.  However, I recently rediscovered the GRDDL usecase that does a decent job of describing how GRDDL mechanisms can be used to address the dual syntactic and semantic interoperability challenges in patient registry and record systems.

XML is the ultimate structured document and messaging format so it is no surprise it is the preferred serialization syntax of the de facto messaging and clinical documentation standard in healthcare systems.  There are other light-weight alternatives (JSON), but it is still the case that XML is the most pervasive standard in this regard.  However, XML is challenged in its ability to specify the meaning of its content in a context-free way (i.e., semantic interoperability).  Viewing RDF content as a rendering of this meaning ( a function of the structured content ) and viewing an XML vocabulary as a special purpose syntax for RDF content is a very useful way to support (all in the same framework): 

  • structured data collection
  • standardized system-to-system messaging
  • automated deductive analysis and transformation
  • structural constraint validation.

In short, XML facilitates the separation of presentation from content and semantic-preserving RDF renderings of XML facilitate the separation of syntax from semantics.  

The diagram above demonstrates how this separation acts as the modern version of the tranditional boundary between transactional systems and their datawarehouse in relational database systems.  In this updated paradigm, however, XML processing is the framework for the former, RDF processing is the framework for the latter, and XSLT (or any similar transform algorithm) is the ETL framework.

In the end, I think it is safe to say that RDF and XSLT are very useful for facilitating semantic and syntactic automation in information systems.  I wish I had presented or written more on this aspect, but I did find some older documents on the topic along with the abstract of the presentation I gave with William Stelhorn during the first Semantic Technologies Conference in 2005:

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} span.s1 {letter-spacing: 0.0px}

The diagram above illustrates how a domain model (described in an RDF vocabulary) can be used by an automated component to generate semantic and syntactic schemas (OWL and RELAX-NG respectively) as well as GRDDL transforms (as XSLT) that faithfully render the structured content in a machine-understandable knowledge representation (RDF).  In this way, many of the data management tooling and infrastructure can be managed by tweaking declarative documentation (not programming code) and automatically generating the infrastructure compiled specifically for a particular domain.

The Challenges

I have mostly mentioned where things worked well.  There were certainly significant challenges to our project, but most of them were not technical in nature.  This seems to be a recurring theme in medical informaticis.  

The main technical challenge has to do with support in RDF databases or triplestores for an appreciable amount of write as well as read operations.  This shortcoming will be particularly emphasized given the emerging SPARQL 1.1 specifications having to do with write operations becoming W3C recommendations and their inevitable adoption.  Most contemporary relational schemas for RDF storage are highly-normalized star schemas where the there isn't a single fact table but rather (in the most advanced cases) this space is partitioned into separate tables and divided along the general, distinct kinds of statements you can have in an RDF graph: RDF statements where the predicate is rdf:type, statements where the object is an RDF literal, and statements where the object is an RDF URI.  So, it is more like a conjoined star schema, for a lack of a better term.  

This minimizes self-joins since the data space is partitioned amongst multiple tables and a query such as this:

SELECT ?MRN { ?PATIENT hasMedicalRecordNumber ?MRN }  

Would ony require (assuming we knew a priori that the range of the hasMedicalRecordNumber property is RDF literals) using the single table for RDF statements where the object is an RDF literal.  If an additional triple pattern is added that matches RDF statements where the object is a URI (or Blank Node), then this results in a join between the two tables rather than a self-join on a single massive table for all the RDF statements.

In such an architecture, there are separate lookup tables that map internal identifiers (integers perhaps) to their full lexical form as URIs or RDF literals for instance.  This enforces the re-use of terms so that space is not wasted if two RDF statemens refer to the same URI.  Internally, they will refer to the same row in the RDF term lookup table.  This is often called string interning.  

However, the downside to this kind of normalization, which is significantly optimized for usecases that primarily involve read access (i.e., datawarehouse or OLAP scenarios), does not bode well when updates are needed to be made to part of an RDF graph in a RDF dataset or the entire graph.  If an RDF statement is removed and it is the only one that references a particular RDF URI, that URI will need to be garbage collected or removed from the RDF term lookup table.  It is such optimizations for OLAP behavior that almost require that high volume updates to RDF content happen as massive ETL jobs where the entire RDF collection of patient record content is replaced rather than doing so one patient record graph at a time.  

This fundamental challenge is the same reason why (some time back) the content repository underlying SemanticDB switched from using RDF to manage the state of the repository (file modification dates, internet media types associated with artifacts, etc.) to using cached, in-memory, pre-parsed XML.  It is also the same reason why we didn't use this capability to allow modifications to patient record documents (via XForms data collection screens) to be immediately reflected into the underlying RDF dataset.  In both cases, writing to the RDF dataset became the prominent bottleneck as the RDF database was not able to support Online Transactional Processing (OLTP).

There was an additional semi-technical challenge that had (and still has) to do with a lack of a small, broad, uniform, upper ontology of clinical medicine.  It needs to be certainly much smaller than SNOMED-CT but certainly able to cover the same material, hence my description of it as an upper ontology.  There are at least 2 projects I know of regarding this:

The latter is my project and was motivated (from the beginning) by this exact problem.  The other problems are massive but not technical.  The two primary ones (in my opinion) are the lack of understanding of - on the one hand - how to concieve and develop a business strategy around open source, open communities, open standards, and open data that relies more on services, the competitive advantage of using emerging technologies where appropriate, and leverages the catalyzing power of web 2.0 / 3.0 technology and social media and - on the other hand - better communication of the merits of semantic web technologies in addressing the engineering and infrastructure challenges of healthcare information systems.  Most of the people who can benefit from better communication in this regard are significantly risk-averse to begin with (another problem in its own right).  

An example of this is that, in my opinion, the only new innovation that semantic web technologies brings to the table is the symbiotic combination of the architecture of the world wide web with knowledge representation.  The use of logic and ontologies to address semantic interoperability challenges predates the semantic web (and even predates Description Logic).  By being precise in describing this difference you can also be precise about decribing the value with very little risk of over stating it.  

Any problem domain where the meaning of data plays a major role will benefit from tranditional deductive database and expert systems (such as prolog or business rule systems respectively) as easily as it would from semantic web technologies.  However, a problem domain where linking data, identifying concepts and digital artifacts in a universal and re-usable way, and leveraging web-based infrastructure is a major factor will benefit from semantic web technologies in a way that it wouldn't from the tradtional alternatives.  This simplification of the value proposition message (for consumption by risk-averse laypeople) also helps to sharpen the distinctions between the markets that a business strategy can target as well as target the engineering problems these emergning technologies should (and should not) attempt to address.  A sharper, straightforward message is needed to break the political and generational barriers that retard the use of potentionally transformational technologies in this field that is a major contribution to the economic instability of this country.

Many of these technological opportunities transfer directly over for use in Patient Controlled Health Records (PCHR) systems.  I also think much of the risk aversion associated with the atmosphere I found myself in after leaving Fourthought (for instance) and generally in large institutions contributes to why the very evident opportunities in leveraging rich web application (“web 2.0”) and semantic web infrastructure (“web 3.0”) have not had as much penetration in healthcare information technology as one would expect.

My new job is as a Senior Research Associate with the Center for Clinical Investigation in the Case Western School of Medicine.  I will be managing Clinical and Translational Science Collaboration (CTSC) clinical, biomedical, and administrative informatics projects as well as designing and developing the informatics infrastructure that supports this.  Coupled with being a part-time P.h.D. student, I will still essentially be doing clinical research informatics (i.e., the use of informatics to facilitate biomedical and health research), however the focus will be on translational research: translating the findings in basic research more quickly and efficiently into medical practice and meaningful health outcomes: physical, mental, or social.  So, I imagine the domain will be closer to the biology end of the spectrum and there will be more of an explicit emphasis on collaboration.

I have thoroughly enjoyed my time working on such a challenging project, in such a world-renown intitution, and working under such a visionary mandate.  I was priviledged to be able to represent the Cleveland Clnic in the W3C through the various working groups developing standards relevant to the way in which we were leveraging semantic web technologies:

  • Semantic Web for Healthcare and Life Sciences Interest Group
  • Data Access Working Group
  • GRDDL Working Group 

If not for the exposure at CCF to the great challenges in medical informatics and equally great opportunities to address them, I would probably have never considered going back to seek a P.h.D in this field.  Although I will be working with a different instutition, it will still essentially be in the Univiersity Circle area and only about a 20 minute walk from where I was at the Cleveland Clinic.  I'm very proud of what we were able to do and I'm looking forward to the future.

A Content Repository API for Rich, Semantic Web Applications?

[by Chimezie Ogbuji]

I've been working with roll-your-own content repositories long enough to know that open standards are long overdue.

The slides for my Semantic Technology Conference 2007 session are up: "Tools for the Next Generation of CMS: XML, RDF, & GRDDL" (Open Office) and (Power point)

This afternoon, I merged documentation of the 4Suite repository from old bits (and some new) into a Wiki that I hope to contribute to (every now and then).
I think there is plenty of mature, supporting material upon which a canon of best practices for XML/RDF CMSes can be written, with normative dependencies on:

  • GRDDL
  • XProc
  • Architecture of the World Wide Web, Volume One
  • URI RFCs
  • Rich Web Application Backplane
  • XML / XML Base / XML Infoset
  • RDDL
  • XHTML 1.0
  • SPARQL / Versa (RDF querying)
  • XPath 2.0 (JSR 283 restriction) with 1.0 'fallback'
  • HTTP 1.0/1.1, ACL-based HTTP Basic / Digest authentication, and a convention for Web-based XSLT invokation
  • Document/graph-level ACL granularity

The things that are missing:

  • RDF equivalent of DOM Level 3 (transactional, named graphs, connection management, triple patterns, ... ) with events.
  • A mature RIF (there is always SWRL, Notation 3, and DLP) as a framework for SW expert systems (and sentient resource management)
  • A RESTful service description to complement the current WSDL/SOAP one

For a RESTful service description, RDF Forms can be employed to describe transport semantics (to help with Agent autonomy), or a mapping to the Atom Publishing Protocol (and a thus a subset of GData) can be written.

In my session, I emphasized how closely JSR 283 overlaps with the 4Suite Repository API.

The delta between them mostly has to do with RDF, other additional XML processing specifications (XUpdate, XInclude, etc.), ACL-based HTTP authentication (basic, and sessions), HTTP/XSLT bindings, and other miscellaneous bells and whistles

Chimezie Ogbuji

via Copia

Planet Atom's Information Pipeline

The hosting of Planet Atom has moved (perhaps temporarily) over to Athena: Christopher Schmidt's excellent hosting service.
Metacognition is hosted there. In addition, the planetatom source was extended to support additional representational modes for the aggregated Atom feed: GRDDL, RDFa, Atom OWL RDF/XML (via content negotiation), and Exhibit.

The latter was the subject of my presentation at XTech 2007. As I mentioned during my session, you can go to http://planetatom.net/exhibit to see the live faceted-browsing of the aggregated Atom feed. An excerpt from the Planet Atom front page describes the nature of the project:

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, John L. Clark, and Chimezie Ogbuji

I wrote previously (in my XML 2006 paper) on this methodology of splicing out multiple (disjoint) representations from a single XML source and the Exhibit mode is yet another facet: specifically for quick, cross-browser, filtering of the aggregated feed.

Planet Atom Pipeline

The Planet Atom pipleline as a whole is pretty interesting. First an XBEL bookmark document is used as the source for aggregation. RESTful caching minimizes load on the sources during aggregation. The aggregation groups the entries by calendar groups (months and days). The final aggregated feed is then sent through several XML pipelines to generate the JSON which drives the Exhibit view, an HTML version of the front page, an XHTML version of the front page (one of the prior two is served to the client depending on the kind of the agent which requested the front page), and an RDF/XML serialization of the feed expressed in Atom OWL.

Note in the diagram that a Microformat approach could have been used instead to embed the Atom OWL statements. RDFa was used instead as it was much easier to encode the statements in a common language and not contend with adding profiles for each Microformat used. Elias's XTech 2007 presentation touched a bit on this contrast between the two approaches. In either case, GRDDL is used to extract the RDF.

These representations are stored statically at the server and served appropriately via a simple CherryPy module As mentioned earlier, the XHTML front page now embeds the Atom OWL assertions about the feed (as well as assertions about the sources, their authors, and the Planetatom developers) in RDFa and includes hooks for a GRDDL-aware Agent to extract a faithful rendition of the feed in RDF/XML. The same XML pipeline which extracts the RDF/XML from the aggregated feed is identified as a GRDDL transform. So, the RDF can either be fetched via content negotiation or by explicit GRDDL processing.

Unfortunately, the RDFa is broken. RDFa can be extracted by an explicit parser (which is how Elias Torrez's Python-based RDFa parser, his recent work on operator, and Ben Adida's RDFa bookmarklets ) or via XSLT as part of a GRDDL mechanism. Due to a quirk in the way RDFa uses namespace declarations (which may or may not be a necessary evil ), the various vocabularies used in the resulting RDF/XML are not properly expanded from the CURIES to their full URI form. I mentioned this quirk to Steven Pemberton.

As it happens, the stylesheet which transforms the aggregated Atom feed into the XHTML host document defines the namespace declarations:

xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xmlns:foaf="http://xmlns.com/foaf/0.1/" 
  xmlns:aOwl="http://bblfish.net/work/atom-owl/2006-06-06/#" 
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

However, since there are not any elements which use QNames formed from these declarations, they are not included in the XSLT result! This trips up the RDF -> RDF/XML transformation (written by Fabien Gandon, a fellow GRDDL WG member) and results in RDF statements where the URIs are simply the CURIEs as originally expressed in RDFa. This seems to only be a problem for XSLT processors which explicitly strip out the unused namespace declarations. They have a right to do this as it has no effect on the underlying infoset. Something for the RDF-in-XHTML task group to consider, especially for scenarios such as this where the XHTML+RDFa is not hand-crafted but produced from an XML pipeline.

[Uche Ogbuji]

via Copia

Musings of a Semantic / Rich Web Architect: What's Next?

I'm writing this on my flight back from XTech 2007, Paris, France. This gives me a decent block of time to express some thoughts and recent developments. This is the only significant time I've had in a while to do any writing.
My family

Between raising a large family, software development / evangelism, and blogging I can only afford to do two of these. So, blogging loses out consistently.

My paper (XML-powered Exhibit: A Case Study of JSON and XML Coexistence) is now online. I'll be writing a follow-up blog on how http://planetatom.net demonstrates some of what was discussed in that paper. I ran into some technical difficulties with projecting from Ubuntu, but the paper covers everything in detail. The slides are here.

My blog todo list has become ridiculously long. I've been heads-down on a handful of open source projects (mostly semantic web related) when I'm not focusing on work-related software development.
Luckily there has been a very healthy intersection of the open source projects I work on and what I do at work (and have been doing non-stop for about 4 years). In a few cases, I've spun these 'mini-projects' off under an umbrella project I've been working on called python-dlp. It is meant (in the end) to be a toolkit for semantic web hackers (such as myself) who want to get their hands dirty and have an aptitude for Python. There is more information on the main python-dlp page (linked above).

sparql-p evaluation algorithm Some of the other things I've been working on I'd prefer to submit to appropriate peer-reviewed outlets considering the amount of time I've put into them. First, I really would like to do a 'proper' write-up on the map/reduce approach for evaluating SPARQL Algebra expressions and the inner mechanics of Ivan Herman's sparql-p evaluation algorithm. The latter is one of those hidden gems I've become closely familiar with for some time that I would very much like to examine in a peer-reviewed paper especially if Ivan is interested doing so in tandem =).

Since joining the W3C DAWG, I've had much more time to get even more familiar with the formal semantics of the Algebra and how to efficiently implement it on-top of sparql-p to overcome the original limitation on the kinds of patterns it can resolve.

I was hoping (also) to release and talk a bit about a SPARQL server implementation I wrote in CherryPy / 4Suite / RDFLib for those who may find it useful as a quick and dirty way to contribute to the growing number of SPARQL endpoints out there. A few folks in irc:///freenode.net/redfoot (where the RDFLib developers hang out) have expressed interest, but I just haven't found the time to 'shrink-wrap' what I have so far.

On a different (non-sem-web) note, I spoke some with Mark Birbeck (at XTech 2007) about my interest in working on a 4Suite / FormsPlayer demonstration. I've spent the better part of 3 years working on FormsPlayer as a client-side platform for XML-driven applications served from a 4Suite repository and I've found the combination quite powerful. FormsPlayer (and XForms 1.1 specifically) is really the icing on the cake which takes an XML / RDF Content Management System like the 4Suite repository and turns it into a complete platform for deploying next generation rich web applications.

The combination is a perfect realization of the Rich Web Application Backplane (a reoccurring theme in my last two presentations / papers) and it is very much worth noting that some of the challenges / requirements I've been able address with this methodology can simply not be reproduced in any other approach: neither vanilla DHTML, .NET, J2EE, Ruby on Rails, Django, nor Jackrabbit. The same is probably the case with Silverlight and Apollo.

In particular, when it comes to declarative generation of user interfaces, I have yet to find a more complete approach than via XForms.

Mark Birbeck's presentation on Skimming is a good read (slides / paper is not up yet) for those not quite familiar with the architectural merits of this larger methodology. However, in his presentation eXist was used as the XML store and it struck me that you could do much more with 4Suite instead. In particular, as a CMS with native support for RDF as well as XML it opens up additional avenues. Consider extending Skimming by leveraging the SPARQL protocol as an additional mode of expressive communication beyond 'vanilla' RESTful operations on XML documents.

These are very exciting times as the value proposition of rich web (I much prefer this term over the much beleaguered Web 2.0+) and semantic web applications has fully transitioned from vacuous / academic musings to concretely demonstrable in my estimation. This value proposition is still not being communicated as well as it could, but having bundled demos can bridge this gap significantly in my opinion; much more so than just literature alone.

This is one of the reasons why I've been more passionate about doing much less writing / blogging and more hands-on hacking (if you will). The original thought (early on this year) was that I would have plenty to write about towards the middle of this year and time spent discussing the ongoing work would be premature. As it happens, things turned out exactly this way.

There is a lesson to be learned for how the Joost project progressed to where it is. The approach of talking about deployed / tested / running code has worked perfectly for them. I don't recall much public dialog about that particular effort until very recently and now they have running code doing unprecedented things and the opportunity (I'm guessing) to switch gears to do more evangelism with a much more effective 'wow' factor.

Speaking of wow, I must say of all the sessions at XTech 2007, the Joost session was the most impressive. The number of architectures they bridged, the list of demonstrable value propositions, the slick design, the incredibly agile and visionary use the most appropriate technology in each case etc.. is an absolutely stunning achievement.

The fact that they did this all while remembering their roots: open standards, open source, open communities leaves me with a deep sense of respect for all those involved in the project. I hope this becomes a much larger trend. Intellectual property paranoia and cloak / dagger completive edge is a thing of the past in today's software problem solving landscape. It is a ridiculously outdated mindset and I hope those who can effect real change (those higher up in their respective ORG charts than the enthusiastic hackers) in this regard are paying close attention. Oh boy. I'm about to launch into a rant, so I think I'll leave it at that.

The short of it is that I'm hoping (very soon) to switch gears from heads-down design / development / testing to much more targeted write-ups, evangelism, and such. The starting point (for me) will be Semantic Technology Conference in San Jose. If the above topics are of interest to you, I strongly suggest you attend my colleague's (Dr. Chris Pierce) session on SemanticDB (the flagship XML & RDF CMS we've been working on at the Clinic as a basis for Computerized Patient Records) as well as my session on how we need to pave a path to a new generation of XML / RDF CMSes and a few suggestions on how to go about paving this path. They are complementary sessions.

Jackrabbit architecture

JSR 170 is a start in the right direction, but the work we've been doing with the 4Suite repository for some time leaves me with the strong, intuitive impression that CMSes that have a natural (and standardized) synthesis with XML processing is only half the step towards eradicating the stronghold that monolithic technology stacks have over those (such as myself) with 'enterprise' requirements that can truly only be met with the newly emerging sets of architectural patterns: Semantic / Rich Web Applications. This stronghold can only be eradicated by addressing the absence of a coherent landscape with peer-reviewed standards. Dr. Macro has an incredibly visionary series of 'write-ups' on XML CMS that paints a comprehensive picture of some best practices in this regard:

However (as with JSR 170), there is no reason why there isn't a bridge or some form of synthesis with RDF processing within the confines of a CMS.

There is no good reason why I shouldn't be able to implement an application which is written against an abstract API for document and knowledge management irrespective of how this API is implemented (this is very much aligned with larger goal of JSR 170). There is no reason why the 4Suite repository is the only available infrastructure for supporting both XML and RDF processing in (standardized) synthesis.

I should be able to 'hot-swap' RDFLib with Jena or Redland, 4Suite XML with Saxon / Libxml / etc.., and the 4Suite repository with an implementation of a standard API for synchronized XML / RDF content management. The value of setting a foundation in this arena is applicable to virtually any domain in which a CMS is a necessary first component.

Until such a time, I will continue to start with 4Suite repository / RDFLib / formsPlayer as a platform for Semantic / Rich Web applications. However, I'm hoping (with my presentation at San Jose) to paint a picture of this vacuum with the intent of contributing towards enough of a critical mass to (perhaps) start putting together some standards towards this end.

Chimezie Ogbuji

via Copia

Where does the Semantic Web converge with the Computerized Patient Record?

I've been thinking alot about the "Computer-based Patient Record: CPR", an acronym as unlikely as GRDDL but once again, a methodology expressed as an engineering specification. In both cases, the methodology is a mouthful, but a coherent architectural "style" and requires a mouthful of words to describe. Other examples of this:

  • Representation State Transfer
  • Rich Web Application Backplane
  • Problem-oriented Medical Record
  • Gleaning Resource Descriptions from Dialects of Languages

The term itself was coined (I think) by the Institute of Medicine [1]. If you are in healthcare and are motivated by the notion of using technology to make healthcare effective and inexpensive as possible, you should do the Institute a favor and buy the book:

National Institutute of Medicine, The Computer-Based Patient Record: An Essential Technology for Health Care - Revised Edition., 1998, ISBN: 0309055326.

I've written some recent slides that are on the W3C ESW 'wiki' which all have something to do with the idea in one way or another:

The nice thing about working in a W3C Interest Group is that the work you do is for the general publics benefit, so it is a manefestation of the W3C notion of the Semantic Web, which primarily involves a human social process.

Sorta like a technological manefestation of our natural darwinian instinct.

That's how I think of the Semantic Web, anyways: as a very old, living thread of advancements in Knowledge Representation which intersected with an anthropological assesment of some recent web architecture engineering standards.

Technology is our greatest contribution and so it sohould only make sense that wherer we use it to better our health it should not come as a cost to us. The slides reference and include a suggested OWL-sanctioned vocabulary for basically implementing the Problem-oriented Medical Record (a clinical methodology for problem solving).

I think the idea of a free (as in beer) vocabulary for people who need healthcare has an interesting intersection with the pragmatic parts of the Semantic Web (avoiding the double quotes) vision. I have exercised-induced asthma (or was "diagnosed" as such when I was younger). I still ran Track-and-Field in Highschool and was okay after an initial period where my lungs had to work overtime. I wouldn't mind hosting RDF content about such a "finding" if it was for my person benefit that a piece of software could do something useful for me in an automated, deterministic way.

"HL7 CDA" seems to be a freely avaiable, well-organized vocabulary for describing messages dispatched between hospital systems. And I recently wrote a set of XSLT templates which extract predicate logic statemnts about a CDA document using the POMR ontology and the other freely available "foundational ontologies" it coordinates. The CDA document on xml.coverpages.org has a nice concise description of the technological merits of HL7 CDA:

The HL7 Clinical Document Architecture is an XML-based document markup standard that specifies the structure and semantics of clinical documents for the purpose of exchange. Known earlier as the Patient Record Architecture (PRA), CDA "provides an exchange model for clinical documents such as discharge summaries and progress notes, and brings the healthcare industry closer to the realization of an electronic medical record. By leveraging the use of XML, the HL7 Reference Information Model (RIM) and coded vocabularies, the CDA makes documents both machine-readable (so they are easily parsed and processed electronically) and human-readable so they can be easily retrieved and used by the people who need them. CDA documents can be displayed using XML-aware Web browsers or wireless applications such as cell phones..."

The HL7 CDA was designed to "give priority to delivery of patient care. It provides cost effective implementation across as wide a spectrum of systems as possible. It supports exchange of human-readable documents between users, including those with different levels of technical sophistication, and promotes longevity of all information encoded according to this architecture. CDA enables a wide range of post-exchange processing applications and is compatible with a wide range of document creation applications."

A CDA document is a defined and complete information object that can exist outside of a messaging context and/or can be a MIME-encoded payload within an HL7 message; thus, the CDA complements HL7 messaging specifications.

If I could put up a CDA document describing the aspects of my medical history that were in my benefit to be freely available (at my discretion), I would do so in the event some piece of software could do some automated things for my benefit. Leveraging a vocabulary which essentially grounds an expressive variant of predicate logic in a transport protocol makes the chances that this happens, very likely. The effect is as multiplicative as the human population.

The CPR specification is also very well engineered and much ahead of its time (it was written about 15 years ago). The only technological checkmark left is a uniform vocabulary. Consensus stands in the way of uniformity, so some group of people need to be thinking about how the "pragmatic" and anthropological notions of the Semantic Web can be realized with a vocabulary about our personally controlled, public clinical content. Don't you think?

I was able to register the /cpr top level PURL domain and the URL http://purl.org/cpr/1.0/problem-oriented-medical-record.owl# resolves to the OWL ontology with commented imports to other very relevant OWL ontologies. Once I see a pragmatic demonstration of leaving owl:imports in a 'live' URL, I'll remove them. It would be a shame if any Semantic Web vocabulary terms came in conflict with a legal mandate which controlled the use of a vocabulary.

Chimezie Ogbuji

via Copia

Amara 1.2rc1

4Suite has been bumped to 1.0.2 with some important bug fixes. I also pushed Amara a step closer to 1.2 with a 1.2rc1 release. I'll make it 1.2 final some time this week, and then on to some pretty big architectural changes for 2.0. All test reports are welcome, especially from Web server users. Jeremy might have figured out a workaround fo the multiple-interpreter issue discussed in "multiple interpreters and extension modules". That should fix remaining known problems with mod_python.

[Uche Ogbuji]

via Copia

Why JSON vs XML is a yawn

Strange spate of discussion recently about XML vs. JSON. On M. David Peterson's Weblog he states what I think is the obvious: there is no serious battle between XML and JSON. They're entirely complementary. Mike Champion responds:

The same quite rational response could be given about the "war" between WS-* and REST, but that has caused quintillions of electrons to change state in vain for the last 5 years or so. The fact remains that some people with a strong attachment to a given technology howl when it is declared to be less than universal. I completely agree that the metaphor of "keep a healthy tool chest and use the right one for the job at hand" is the appropriate response to all these "wars", but such boring pragmatism doesn't get Diggs or Pagerank.

If I may be so bold as to assume that "pragmatism" includes in some aspect of its definition "that which works", I see a bit of a "one of these things is not like the other" game (sing along, Sesame Street kids) in Mike's comparison.

  • XML - works
  • JSON - works
  • REST - works
  • WS-Kaleidoscope - are you kidding me?

Some people claim that the last entry works, but I've never seen any evidence beyond the "it happened to my sister's boyfriend's roomate's cousin" variety. On the other hand, by the time you click through any ten Web sites you probably have hard evidence that the other three work, and in the case of REST you have that evidence by the time you get to your first site.

For my part, I'm a big XML cheerleader, but JSON is great because it gives people a place to go when XML isn't really right. There are many such places, as I've often said ("Should Python and XML Coexist?", "Alternatives to XML", etc.) Folks tried from the beginning to make XML right for data as well as documents, and even though I think the effort made XML more useful than its predecessors, I think it's clear folks never entirely succeeded. XML is much better suited to documents and text than records and data. The effort to make it more suitable for data leads to the unfortunate likes of WXS (good thing there's RELAX NG) and RDF/XML (good thing there's Turtle). Just think about it: XQuery for JSON. Wouldn't it be so much simpler and cleaner than our XQuery for XML? Heck, wouldn't it be pretty much...SQL?

That having been said there is one area where I see some benefit to XQuery. Mixed-mode data/document storage is inevitable given XML's impressive penetration. XQuery could be a thin layer for extracting data feeds from these mixed-mode stores, which can then be processed using JSON. If the XQuery layer could be kept thin enough, and I think a good architect can ensure this, the result could be a very neat integration. If I had ab initio control over such a system my preference would be schema annotations and super-simple RDF for data/document integration. After all, that's a space I've been working in for years now, and it is what I expect to focus on at Kadomo. But I don't expect to always be so lucky. Then again, DITA is close enough to that vision that I can be hopeful people are starting to get it, just as I'm grateful that the development of GRDDL means that people in the Semantic Web community are also starting to get it.

On the running code front I've been working on practical ways of working nicely with XML and JSON in tandem. The topic has pervaded several aspects of my professional work all at once in the past few months, and I expect to have a lot of code examples and tools to discuss here on Copia soon.

[Uche Ogbuji]

via Copia

Validation on the rack

Mark Baker's "Validation considered harmful" touched off a fun series of responses.

[C]onsider the scenario of two parties on the Web which want to exchange a certain kind of document. Party A has an expensive support contract with BigDocCo that ensures that they’re always running the latest-and-greatest document processing software. But party B doesn’t, and so typically lags a few months behind. During one of those lags, a new version of the schema is released which relaxes an earlier stanza in the schema which constrained a certain field to the values “1″, “2″, or “3″; “4″ is now a valid value. So, party B, with its new software, happily fires off a document to A as it often does, but this document includes the value “4″ in that field. What happens? Of course A rejects it; it’s an invalid document, and an alert is raised with the human [administrator], dramatically increasing the cost of document exchange. All because evolvability wasn’t baked in, because a schema was used in its default mode of operation; to restrict rather than permit.

Upon reading this I had 2 immediate reactions:

  1. Yep. Walter Perry was going on about all this sort of thing a long time ago, and the industry would be in a much saner place, without, for example crazy ideas such as WS-Kaleidoscope and tight binding of documents to data records (read WXS and XQuery). For an example of how Perry absolutely skewered class-conscious XML using a scenario somewhat similar to Mark's, read this incisive post. To me the perils of bondage-and-discipline validation are as those of B&D datatyping. It's all more example of the poor design that results when you follow twopenny Structured Programming too far and let early binding rule the cosmos.

  2. Yep. This is one of the reasons why once you use Schematron and actually deploy it in real-life scenarios where schema evolution is inevitable, you never feel sanguine about using a grammar-based schema language (not even RELAX NG) again.

Dare's response took me aback a bit.

The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions. First of all, every system does some form of validation because it cannot process arbitrary documents. For example an RSS reader cannot do anything reasonable with an XBRL or ODF document, no matter how liberal it is in what it accepts. Now that we have accepted that there are certain levels validation that are no-brainers the next question is to ask what happens if there are no constraints on the values of elements and attributes in an input document. Let's say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?

Dare is not incorrect, but I was surprised at his reading of Mark. When I considered it carefully, though, I realized that Mark did leave himself open to that interpretation by not being explicit enough. As he clarified in comment to Dare:

The problem with virtually all uses of validation that I've seen is that this document would be rejected long before it even got to the bit of software which cared about currency. I'm arguing against the use of validation as a "gatekeeper", not against the practice of checking values to see whether you can process them or not ... I thought it goes without saying that you need to do that! 8-O

I actually think this is a misunderstanding that other readers might easily have, so I think it's good that Dare called him on it, and teased out the needed clarification. I missed it because I know Mark too well to ever imagine he'd ever go so far off in the weeds.

Of course the father of Schematron would have a response to reckon with in such debate, but I was surprised to find Rick Jelliffe so demure about Schematron. His formula:

schema used to validating incoming data only validates traceable business requirements

Is flash-bam-alakazam spot on, but somewhat understated. Most forms of XML validation do us disservice by making us nit-pick every detail of what we can live with, rather than letting us make brief declarations of what we cannot live without. Yes Schematron's phases provide a powerful mechanism for elagant modularization of expression of rules and requirements, but long before you go that deep Schematron sets you free by making validation an open rather than a closed operation. The gains in expressiveness thereby provided are near astonishing, and this is despite the fact that Schematron is a less terse schema language than DTD, WXS or RELAX NG.

Of course XML put us on the road to unnecessary bondage and discipline on day one when it made it so easy, and even a matter of recommendation, to top each document off with a lordly DTD. Despite the fact that I think Microformats are a rickety foundation for almost anything useful at Web scale, I am hopeful they will act as powerful incentive for moving the industry away from knee-jerk validation.

[Uche Ogbuji]

via Copia