Triclops gets a facelift, new query management capabilities, and new APIs

by Chimezie Ogbuji

I recently had a need to manage a set of queries against an OWL2 EL biomedical ontology: the Foundational Model of Anatomy. I have an open source SPARQL service implementation that I had some thoughts about extending with support for managing queries. It’s called Triclops and is part of a collection of RDF libraries and tools I have been accumulating. The name is a reference to an initial attempt to build an RDF querying and navigation interface as part of the 4Suite repository back in the day (circa 2002).

This later evolved to a very rudimentary web interface that sat in front of the Oracle 11g and MySQL/SPARQL patient dataset that Cyc’s SKSI interacted with. This was part of an interface tailored to the task of identifying patient cohorts, known as the Semantic Research Assistant (SRA). A user could dispatch handwritten SPARQL queries, browse clickable results, or return them as CSV files. This capability was only used by informaticians familiar with the structure of the RDF dataset and most investigators used the SRA.

It also implemented a RESTful protocol for ticket-based querying that was used for stopping long-running SPARQL/MySQL queries. This is not currently documented. Around the time this was committed as an Apache-licensed, Google code library, layercake-python added core support for APIs that treated remote SPARQL services as local Graph objects as well as general support for connecting SPARQL services. This was based on Ivan Herman’s excellent SPARQL Endpoint interface to Python.

Triclops (as described in the wiki) can now be configured as a “Proxy SPARQL Endpoint”. It can be deployed as a light-weight query dispatch, management, and mediation kiosk for remote and local RDF datasets. The former capability (dispatching) was already in place, the latter (mediation) can be performed using FuXi’s recent capabilities in this regard.

Specifically, FuXi includes an rdflib Store that uses its sideways-information passing (sip) strategies the in-memory SPARQL algebra implementation for use as a general-purpose framework for semweb SPARQL (OWL2-RL/RIF/N3) entailment regimes. Queries are mediated over the SPARQL protocol using global schemas captured as various kinds of semweb ontology artifacts (expressed in a simple Horn form) that describe and distinguish their predicates by those instantiated in a database (or factbase) and those derived via the semantic properties of these artifacts.

So the primary capability that remained was query management and so this recent itch was scratched over the holidays. I discovered that CodeMirror , a JavaScript library that can be used to create a relatively powerful editor interface for code, had excellent support for SPARQL. I integrated it into Triclops as an interface for managing SPARQL queries and their results. I have a running version of this at http://metacognition.info/sparql/queryMgr. Note, the service is liable to break at any point as Webfaction kills of processes that use up alot of CPU and I have yet to figure out how to configure it to restart the service when it dies in such a fashion.

The dataset this interface manages queries for is a semantic web of content comprising 3 of the primary, ancient Chinese, classical texts (the Analects, Doctrine of the Mean, and the Tao Te Ching). I record the information in RDF because it is an intuitive knowledge representation to use in capturing provenance, exposition, and other editorial meta data. Below is a screen shot of the main page listing a handful of queries, their name, last date of modification, date of last run, and number of solutions in the recent result.

Main SPARQL service page

Above the list is a syntax-highlighted text area for dispatching adhoc SPARQL queries. This is where CodeMirror is integrated. If I click on the name of the query titled “Query for Analects and the Doctrine of the Mean english chapter text (Confucius)”, I go to a similar screen with another text area whose content corresponds to the text of the query (see the screen shot below).

Main SPARQL service page

From here queries can be updated (by submitting updated CodeMirror content) or cloned (using the name field for the new copy). Alternatively, the results of previous queries can be rendered. This sends back a result document with an XSLT processing instruction that causes the browser to trigger a request for a stylesheet and render an XHTML document from content in the result document on the client side.

Finally, a query can be re-executed against a dataset, saving the results and causing the information in the first screen to show different values for the last execution run (date and number of solutions). Results can also be saved or viewed as CSV using a different stylesheet against the result document.

The last capability added is a rudimentary template system where any variable in the query or text string of the form ‘$ …. $’ is replaced with a provided string or a URI. So, I can change the pick list value on the second row of the form controls to $searchExpression$ and type “water”. This produces a SPARQL query (visible with syntax highlighting via CodeMirror) that can be used as a template to dispatch queries against the dataset.

In addition, solutions for a particular variable can be used for links, providing a small framework for configurable, navigation workflows. If I enter “[Ww]ater” in the field next to $searchExpression$, select classic from the pick list at the top of the Result navigation template area, pick “Assertions in a (named) RDF graph” from the next pick list, and enter the graphIRI variable in the subsequent text input.

Triggering this form submission will produce the result screen pictured below. As specified in the form, clicking any of the the dbpedia links for the Doctrine of the Mean will initiate the invokation of the query titled “Assertions in a (named) RDF graph”, and shown below (with the graphIRI variable pre-populated with the corresponding URI):

SELECT DISTINCT ?s ?p ?o where {
    GRAPH ?graphIRI {
      ?s ?p ?o
    }
}

Main SPARQL service page

The result of such an action is shown in the screen shot. Alternatively, a different subsequent query can be used: “Statements about a resource”. The relationship between the schema of a dataset and the factbase can be navigated in a similar way. Picking the query titled “Classes in dataset” and making the following modifications. Select “Instances of a class and graph that the statements are asserted in” from the middle pick list of the Result navigation template section. Enter ?class in the text field to the right of this. Selecting ‘Execute..’ and executing this query results in a clickable result set comprised of classes of resources and clicking any such link shows the instances of that class.

Main SPARQL service page

This latter form of navigation seems well suited for exploring datasets for which either there is no schema information in the service or it is not well known by the investigator writing the queries.

In developing this interface, at least 2 architectural principles were re-used from my SemanticDB development days: the use of XSLT on the client side to build rich, offloaded (X)HTML applications and the use of the filesystem for managing XML documents rather than a relational database. The latter (use of a filesystem) is particularly more relevant where querying across the documents is not a major requirement or even a requirement at all. The former is via setting the processing instruction of a result document to refer to a dynamically generated XSLT document on the server.

The XSLT creates a tabular, row-distinguishing, tabular interface where the links to certain columns trigger queries via a web API that takes various input, including: the variable in the current query whose solutions are ‘streamed’, a (subsequent) query specified by some function of the MD5 hash of its title, a variable in that query that is pre-populated with the corresponding solution, etc:

../query=...&action=update&innerAction=execute,templateValue=...,&valueType=uri&variable=..

Eventually, the API should probably be made more RESTful and target the query, possibly leveraging some caching mechanism in the process. Perhaps it can even work in concert with the SPARQL 1.1 Graph Store HTTP Protocol.

Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
Options:
 -h, --help            show this help message and exit
 -q QUOTECHAR, --quoteChar=QUOTECHAR
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
 -o FILEPATH, --output=FILEPATH
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   cnt=Counter()
   @coroutine
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               cnt.counter+=1
           else:
               rt=[]
               badChars = False
               for binding in node.binding:
                   try:
                       rt.append(U(binding).encode('ascii'))
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
               csvWriter.writerow(rt)
       return
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   target.close()
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

My Thoughts after 7 Years of Being a Web-based Patient Registry Architect

Monday January 17th, 2011 is my last day at the Cleveland Clinic where I was Lead Systems Analyst for 7 years working on a very exciting project with a goal to replace a relational Cardiovascular Information Registry that supported research in cardiovascular medicine and surgery and consolidate data management for research purposes in the surgery department.  Eventually, our Clinical Investigation unit became part of the Heart and Vascular Institute.  

The long-term goal was to create a framework for context-free data management systems in which expert-provided, domain-specific knowledge is used to control all aspects of data entry, storage, display, retrieval, communication, and formatting for external systems.  By ‘context-free’, I mean that the framework can be used for any domain (even outside of medicine) and nothing about the domain is assumed or hardcoded.  The use of metadata was envisioned as key to facilitating this capability and to this end the use of RDF was effective as a web and logic-based knowledge representation.

At the time, I was unemployed soon after the post 9-11 dot.com economic and innovation bubble wherein there was great risk aversion to using the emerging technologies of the time: XML, RDF, XSLT, Python, REST, etc.  I was lucky that in the city whereI was born, a couple of miles (literally) from where I was born and my mother worked there was a great job opportunity under a mandate from our director for a de novo, innovative architecture.  I went for the interview and was fortunate to get the job.

Four years later (in the fall of 2007), It was deployed for production use by nurses, coders, and researchers and built on top of the predecessor of Akara, an XML & RDF content repository:  

The diagram above documents the application server and services stack that is used by the various components that interface with the repository.  The web services paradigm was never used and most of the service-oriented architecture was pure HTTP with XML (i.e., POX/HTTP)

Mozilla’s Firefox with XForms extension was used for about 5 years for the complete collection of longitudinal patient record content for a little over 200,000 patients who had operations with cardiac(-thoracic) surgical component(s) in a registry.  The architectural methodology was such that the use and deployment of infrastructure (entire data collection screens, XML and RDF schemas, data transformations, etc.) was significantly automated.  W3C document management and semantic web representation standards (HTTP, RDF, XML, N3, OWL, and SPARQL) were used to ensure interoperability, extensibility, and automation, in particular as infrastructure for both (certified) quality measure reporting and a clinical research repository.

I wanted to take the time to coalesce my experience in working on this project and share some of the key desiderata, architectural constraints (that perhaps comprise an architectural style), and opportunities in using these emerging technologies to address the primary engineering challenges of clinical research repositories and patient registries.  The following functionalities stood out as very implementable for such an approach: 

  • patient record abstraction (data entry via knowledge-generated input screens)
  • workflow management
  • data quality management
  • identification of patients cohorts for study
  • data export for statistical analysis

Patient Record Abstraction

XForms brings a rich history of browser-based data entry to bear as comprehensive, declarative syntax that is part of a new architectural paradigm and works well with the architectural style of the World Wide Web: REST.  It abstracts widgets, controls, data bindings, logic, remote data management, integration into a host language, and other related rich internet application requirements.  

I had to check the way back machine but found the abstract of the 2006 presentation: “The Essence of Declarative, XML-based Web Applications: XForms and XSLT”.  There I discussed best practices, common patterns, and pitfalls in using XSLT as a host language for generating web-based user interfaces expressed in XForms.  The XForms-generating infrastructure was quite robust and everything from screen placement, behavior, range checking, drop-down lists, and data field dependencies were described in an XML document written using a microformat designed for templating user interfaces for patient registry data collection.

All in all (and I hope John writes about this at some point since he was the primary architect of the most advanced manifestation of this approach), the use of a microformat for documenting and generating an XForms framework for controlled, validated, form-based data collection was very adequate and sufficed to allow the (often unpredictable) requirements of national registries reporting and our clinical studies to dictate automatically deployed changes to a secure (access controlled) patient record web portal and repository.

Regarding quality management, on December 2007, John and I were supposed to present at XML 2007 about how

XForms provides a direct interface to the power of XML validation tools for immediate and meaningful feedback to the user about any problems in the data. Further, the use of validation components enables and encourages the reuse of these components at other points of entry into the system, or at other system boundaries.

I also found the details of the session (from the online conference schedule) on the wayback machine from the XML 2007 conference site.  The presentation was titled: “Analysis of an architecture for data validation in end-to-end XML processing systems.

Client-side XForms constraint mechanisms as well as server-side schematron validation was the basis for quality management at the point of data entry (which is historically the most robust way to address errors in medical record content).  All together, the Mozilla browser platform presents quite an opportunity to offload sophisticated content management capabilities from the server to the client and XML processing and pipelining played a major role in this regard.  

Workflow Management

See: A Role for Semantic Web Technologies in Patient Record Data Collection where I described my chapter in the LInked Enterprise Data book regarding the use of semantic web technologies to facilitate patient record data collection workflow.  In the Implementation section, I also describe how declarative AJAX frameworks such as Simile Exhibit were integrated for use in faceted browsing and vizualization of patient record content.  RDF worked well as the state machine of a workflow engine.  The digital artifacts involved in the workflow application and the messages sent back and forth from browser to server are XML and JSON documents.  The content in the documents are mirrored into an RDF dataset describing the state of the workflow task which can be queried when listing the workflow tasks associated with the currently logged in user (for instance).  

This is an archetype of an emerging pattern where XML is used as the document and messaging syntax and a semantics-preserving RDF rendering of the XML content (mirrored persistently in an RDF dataset) is used as the knowledge representation for inference and querying.  More on this archetype later.  

Identification of Patient Cohorts for Study

Eric Prud’hommeaux has been doing alot of excellent infrastructure work (see: SWObjects) in the Semantic Web for Healthcare and Life Sciences Interest Group around the federated use of SPARQL for querying structured, discrete EHR data in a meaningful way and in a variety of contexts: translational research, clinical observations interoperability, etc.  His positive experience has mirrored ours in the use of SPARQL as the protocol for querying a patient outcome registry and clinical research database.

However, we had the advantage of already having the data natively available as a collection of RDF graphs (each of which describes a longitudinal patient record) from the registry.  A weekly ETL-like process recreates an RDF dataset from the XML collection and serves as the snapshot data warehouse for the operational registry, which relies primarily on XML payload for the documentation, data collection, and inter-system messaging needs.  Other efforts have been more focused on making extant relational EHR data available as SPARQL.  

This was the backdrop to the research we did with Case Western Reserve University P.h.D students in the summer and fall of 2008 on the efficient use of relational algebra to evaluate the SPARQL language in its entirety.  

GRDDL Mechanisms and Dual Representation

Probably the most substantive desiderata or opportunity in patient record systems (and registries) of the future is as hard to articulate as it is to (frankly) appreciate and understand.  However, I recently rediscovered the GRDDL usecase that does a decent job of describing how GRDDL mechanisms can be used to address the dual syntactic and semantic interoperability challenges in patient registry and record systems.

XML is the ultimate structured document and messaging format so it is no surprise it is the preferred serialization syntax of the de facto messaging and clinical documentation standard in healthcare systems.  There are other light-weight alternatives (JSON), but it is still the case that XML is the most pervasive standard in this regard.  However, XML is challenged in its ability to specify the meaning of its content in a context-free way (i.e., semantic interoperability).  Viewing RDF content as a rendering of this meaning ( a function of the structured content ) and viewing an XML vocabulary as a special purpose syntax for RDF content is a very useful way to support (all in the same framework): 

  • structured data collection
  • standardized system-to-system messaging
  • automated deductive analysis and transformation
  • structural constraint validation.

In short, XML facilitates the separation of presentation from content and semantic-preserving RDF renderings of XML facilitate the separation of syntax from semantics.  

The diagram above demonstrates how this separation acts as the modern version of the tranditional boundary between transactional systems and their datawarehouse in relational database systems.  In this updated paradigm, however, XML processing is the framework for the former, RDF processing is the framework for the latter, and XSLT (or any similar transform algorithm) is the ETL framework.

In the end, I think it is safe to say that RDF and XSLT are very useful for facilitating semantic and syntactic automation in information systems.  I wish I had presented or written more on this aspect, but I did find some older documents on the topic along with the abstract of the presentation I gave with William Stelhorn during the first Semantic Technologies Conference in 2005:

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} span.s1 {letter-spacing: 0.0px}

The diagram above illustrates how a domain model (described in an RDF vocabulary) can be used by an automated component to generate semantic and syntactic schemas (OWL and RELAX-NG respectively) as well as GRDDL transforms (as XSLT) that faithfully render the structured content in a machine-understandable knowledge representation (RDF).  In this way, many of the data management tooling and infrastructure can be managed by tweaking declarative documentation (not programming code) and automatically generating the infrastructure compiled specifically for a particular domain.

The Challenges

I have mostly mentioned where things worked well.  There were certainly significant challenges to our project, but most of them were not technical in nature.  This seems to be a recurring theme in medical informaticis.  

The main technical challenge has to do with support in RDF databases or triplestores for an appreciable amount of write as well as read operations.  This shortcoming will be particularly emphasized given the emerging SPARQL 1.1 specifications having to do with write operations becoming W3C recommendations and their inevitable adoption.  Most contemporary relational schemas for RDF storage are highly-normalized star schemas where the there isn't a single fact table but rather (in the most advanced cases) this space is partitioned into separate tables and divided along the general, distinct kinds of statements you can have in an RDF graph: RDF statements where the predicate is rdf:type, statements where the object is an RDF literal, and statements where the object is an RDF URI.  So, it is more like a conjoined star schema, for a lack of a better term.  

This minimizes self-joins since the data space is partitioned amongst multiple tables and a query such as this:

SELECT ?MRN { ?PATIENT hasMedicalRecordNumber ?MRN }  

Would ony require (assuming we knew a priori that the range of the hasMedicalRecordNumber property is RDF literals) using the single table for RDF statements where the object is an RDF literal.  If an additional triple pattern is added that matches RDF statements where the object is a URI (or Blank Node), then this results in a join between the two tables rather than a self-join on a single massive table for all the RDF statements.

In such an architecture, there are separate lookup tables that map internal identifiers (integers perhaps) to their full lexical form as URIs or RDF literals for instance.  This enforces the re-use of terms so that space is not wasted if two RDF statemens refer to the same URI.  Internally, they will refer to the same row in the RDF term lookup table.  This is often called string interning.  

However, the downside to this kind of normalization, which is significantly optimized for usecases that primarily involve read access (i.e., datawarehouse or OLAP scenarios), does not bode well when updates are needed to be made to part of an RDF graph in a RDF dataset or the entire graph.  If an RDF statement is removed and it is the only one that references a particular RDF URI, that URI will need to be garbage collected or removed from the RDF term lookup table.  It is such optimizations for OLAP behavior that almost require that high volume updates to RDF content happen as massive ETL jobs where the entire RDF collection of patient record content is replaced rather than doing so one patient record graph at a time.  

This fundamental challenge is the same reason why (some time back) the content repository underlying SemanticDB switched from using RDF to manage the state of the repository (file modification dates, internet media types associated with artifacts, etc.) to using cached, in-memory, pre-parsed XML.  It is also the same reason why we didn't use this capability to allow modifications to patient record documents (via XForms data collection screens) to be immediately reflected into the underlying RDF dataset.  In both cases, writing to the RDF dataset became the prominent bottleneck as the RDF database was not able to support Online Transactional Processing (OLTP).

There was an additional semi-technical challenge that had (and still has) to do with a lack of a small, broad, uniform, upper ontology of clinical medicine.  It needs to be certainly much smaller than SNOMED-CT but certainly able to cover the same material, hence my description of it as an upper ontology.  There are at least 2 projects I know of regarding this:

The latter is my project and was motivated (from the beginning) by this exact problem.  The other problems are massive but not technical.  The two primary ones (in my opinion) are the lack of understanding of - on the one hand - how to concieve and develop a business strategy around open source, open communities, open standards, and open data that relies more on services, the competitive advantage of using emerging technologies where appropriate, and leverages the catalyzing power of web 2.0 / 3.0 technology and social media and - on the other hand - better communication of the merits of semantic web technologies in addressing the engineering and infrastructure challenges of healthcare information systems.  Most of the people who can benefit from better communication in this regard are significantly risk-averse to begin with (another problem in its own right).  

An example of this is that, in my opinion, the only new innovation that semantic web technologies brings to the table is the symbiotic combination of the architecture of the world wide web with knowledge representation.  The use of logic and ontologies to address semantic interoperability challenges predates the semantic web (and even predates Description Logic).  By being precise in describing this difference you can also be precise about decribing the value with very little risk of over stating it.  

Any problem domain where the meaning of data plays a major role will benefit from tranditional deductive database and expert systems (such as prolog or business rule systems respectively) as easily as it would from semantic web technologies.  However, a problem domain where linking data, identifying concepts and digital artifacts in a universal and re-usable way, and leveraging web-based infrastructure is a major factor will benefit from semantic web technologies in a way that it wouldn't from the tradtional alternatives.  This simplification of the value proposition message (for consumption by risk-averse laypeople) also helps to sharpen the distinctions between the markets that a business strategy can target as well as target the engineering problems these emergning technologies should (and should not) attempt to address.  A sharper, straightforward message is needed to break the political and generational barriers that retard the use of potentionally transformational technologies in this field that is a major contribution to the economic instability of this country.

Many of these technological opportunities transfer directly over for use in Patient Controlled Health Records (PCHR) systems.  I also think much of the risk aversion associated with the atmosphere I found myself in after leaving Fourthought (for instance) and generally in large institutions contributes to why the very evident opportunities in leveraging rich web application (“web 2.0”) and semantic web infrastructure (“web 3.0”) have not had as much penetration in healthcare information technology as one would expect.

My new job is as a Senior Research Associate with the Center for Clinical Investigation in the Case Western School of Medicine.  I will be managing Clinical and Translational Science Collaboration (CTSC) clinical, biomedical, and administrative informatics projects as well as designing and developing the informatics infrastructure that supports this.  Coupled with being a part-time P.h.D. student, I will still essentially be doing clinical research informatics (i.e., the use of informatics to facilitate biomedical and health research), however the focus will be on translational research: translating the findings in basic research more quickly and efficiently into medical practice and meaningful health outcomes: physical, mental, or social.  So, I imagine the domain will be closer to the biology end of the spectrum and there will be more of an explicit emphasis on collaboration.

I have thoroughly enjoyed my time working on such a challenging project, in such a world-renown intitution, and working under such a visionary mandate.  I was priviledged to be able to represent the Cleveland Clnic in the W3C through the various working groups developing standards relevant to the way in which we were leveraging semantic web technologies:

  • Semantic Web for Healthcare and Life Sciences Interest Group
  • Data Access Working Group
  • GRDDL Working Group 

If not for the exposure at CCF to the great challenges in medical informatics and equally great opportunities to address them, I would probably have never considered going back to seek a P.h.D in this field.  Although I will be working with a different instutition, it will still essentially be in the Univiersity Circle area and only about a 20 minute walk from where I was at the Cleveland Clinic.  I'm very proud of what we were able to do and I'm looking forward to the future.

A Role for Semantic Web Technologies in Patient Record Data Collection

I found out today that not only is the Linking Enterprise Book now available but it is also freely available online as well as in other avenues (Springer and pre-order on Amazon):

Linking Enterprise Data is the application of Semantic Web architecture principles to real-world information management issues faced by commercial, not-for-profit and government enterprises.This book aims to provide practical approaches to addressing common information management issues by the application of Semantic Web and Linked Data research to production environments.

 

I wrote a chapter ("A Role for Semantic Web Technologies in Patient Record Data Collection") discussing the debate around SOAP-based web services and Representational State Transfer (REST) that focuses on a specific, deployed use case that emphasizes the role of the Semantic Web, a simple Web application architecture that leverages the use of declarative XML processing, and the needs of a workflow system for patient record data collection.  It touches just a bit some of the use of XForms to manage patient record content as special-purpose XML dialects for RDF graphs that I mentioned in my last post but is mostly focused on how to use RDF to manage workflow state to orchestrate data collection of patient data.

Business Process Management Systems (BPMS) are a component of the stack of Web standards that comprise Service Oriented Architecture (SOA). Such systems are representative of the architectural framework of modern information systems built in an enterprise intranet and are in contrast to systems built for deployment on the larger World Wide Web. The REST architectural style is an emerging style for building loosely coupled systems based purely on the native HTTP protocol. It is a coordinated set of architectural constraints with a goal to minimize latency, maxi- mize the independence and scalability of distributed components, and facilitate the use of intermediary processors. Within the development community for distributed, Web-based systems, there has been a debate regarding the merits of both approaches. In some cases, there are legitimate concerns about the differences in both architec- tural styles. In other cases, the contention seems to be based on concerns that are marginal at best. 

In this chapter, we will attempt to contribute to this debate by focusing on a specific, deployed use case that emphasizes the role of the Semantic Web, a simple Web application architecture that leverages the use of declarative XML processing, and the needs of a workflow system. The use case involves orchestrating a work process associated with the data entry of structured patient record content into a research registry at the Cleveland Clinic’s Clinical Investigation department in the Heart and Vascular Institute

Why XML-based web forms are an excellent platform for clinical data entry into RDF

Uche and I have written a bit on XForms on copia. I've recently been motivated to better articulate why I think the use of XForms, Plain Old XML (POX), and GRDDL (or faithful renditions of RDF graphs if you will) is a more robust web architecture for managing mutable RDF content for the purpose of research data management than other thin-client approaches, for instance.

Some time ago, I asked:

Are there examples of tools or architectures that demonstrate support for the Model View Controller (MVC) paradigm for data entry directly against RDF content? It seems to be that there is an inherent impedance mismatch with that is needed for an efficient, documented-hosted, binding-oriented architecture for data entry and the amorphous nature of RDF as well as the cost of using RDF querying as a mechanism for binding data to UI elements. 

In my experience since 2006 as a software architect of web applications that use XForms to manage patient record documents as RDF graphs, I've come to appreciate that the 'CRUD problem' of RDF might have good protocol solutions being developed right now, but the question of whether there is anything more robust for forms-based data collection than declarative, auto-generated XForms that mange RDF datasets is a more difficult one, I think.

My personal opinion is that the nature of the abstract syntax of an RDF graph (as opposed to the tree underlying the XML infoset), its impact on binding RDF resources to widgets, and the ubiquitous use of warehouse relational schemas as infrastructure for RDF datasets in databases will always be an insurmountable performance impediment for alternative solutions at larger volumes that are more robust than using XForms to manage an XML collection on a filesystem as a faithful rendition of an RDF dataset.

RDF/SQL databases are normalized and optimized more for read than for write - with asymptotic consequences to write operations. An architecture that directly manages very large numbers (millions) of RDF triples will be faced with this challenge. The OLTP / OLAP divide in legacy relational architecture is analogous to the use of XML and RDF in those respective roles and is an intuitive architectural style for using knowledge representation in content management systems. GRDDL and its notion of faithful  renditions can be used to manage this divide as infrastructure for contemporary content management systems. 

For the purpose of read-only browsing, however, RDF lenses and facets are a useful alternative. However, if you need support for controlled vocabularies, heavily-dependent constraint validation, declarative and auto-generated templating, and large amounts of concurrent data entry over large amount of RDF data, the rich web architecture backplane is just very robust in my experience and in others as well. 

I had to dig into the way back machine to find the XML technologies presentation John and I were supposed to give in December of 2007 (right before my life changed forever). I need to bug him to put copies of those slides on his weblog about using XForms with schematron for real-time validation as a component of data entry.

Numerical type with units - via Python Cookbook

I implemented dimensions.py perhaps eight years ago as an exercise and have used it occasionally ever since.

It allows doing math with dimensioned values in order to automate unit conversions (you can add m/s to mile/hour) and dimensional checking (you can't add m/s to mile/lightyear). It specifically does not convert 212F to 100C but rather will convert 9F to 5C (valid when converting temperature differences).

It is similar to unums (http://home.scarlet.be/be052320/Unum.html) but with a significant difference:

I used a different syntax Q(25,'m/s') as opposed to 100*m/s (I recall not wanting to have all the base SI units directly in the namespace). I'm not entirely sure which approach is really better.

I also had a specific need to have fractional exponents on units, allowing the following:

>>> km=Q(10,'N*m/W^(1/2)')
>>> km
Q(10.0, 'kg**0.5*m/s**0.5')

Looking back I see a few design decisions I might do differently today, but I'll share it anyway.

Some examples are in the source below the line with if __name__ == "__main__":

Note that I've put two files into the code block below, dimensions.py and dimensions.data, so please cut them apart if you want to try it.

Very impressive library. I recently incorporated the use of the Measurement Unit Ontology into the Computer-based Patient Record (CPR) ontology and (on the surface) it seems like a library like this can provide the unit conversion machinery for RDF instances that use such a framework.

Semantic Web Hubris

I've been reading The Icarus Syndrome, so the word Hubris has been on my mind lately.
I woke up with what I hope is a more concise way to express something I've
been going on and on about:

RDF is a web-based approach to knowledge representation (KR). Linked
Data is an idea that an important component of serendipity in this KR
is the addressability of its names over HTTP. The idea that this is
the most important component of its serendipity is hubris.

True Knowledge - A logic based web question-answering platform

The world's first AI question-answering platform.

We are using our unique semantic technology to build the first internet-scale platform for directly answering the world's questions. As knowledge is added to the platform we understand and answer more and more.

The True Knowledge session at the Semantic Technologies Conference 2010 was where I first heard of this and tried to use their web-based interface during their presentation and was very impressed by the interface. It includes a justification trace of how answers are reached and handles things such as temporal reasoning as well.  Also includes a Google chrome extension to enhance google answers with results to the same questions.

-- Chimezie

SNOMED-CT Management via Semantic Web Open Source Tools Committed to Google Code

[by Chimezie Ogbuji]

I just committed my working copy of the set of tools I use to manipulate and serialize SNOMED-CT (the Systematized Nomenclature of Medicine) and the Foundational Model of Anatomy (FMA) as OWL/RDF for use in the clinical terminology research I’ve been doing lately. It is still in a very rough form and probably not usable by anyone other than a Python / Semantic Web hacker such as myself. However, I’m hoping to get it to a shape where it can be used by others. I had hesitated to release it mostly because of my concerns around the SNOMED-CT license, but I’ve been assured that as long the hosting web site is based in the united states and (most importantly) the software is not released with the SNOMED distribution it should be okay.

I have a (mostly empty) Wiki describing the command-line invocation. It leverages InfixOWL and rdflib to manipulate the OWL/RDF. Basically, once you have loaded the delimited distribution into MySQL (the library also requires MySQLdb and an instance of MySQL to work with), you can run the command-line, giving it one or more list of SNOMED-CT terms (by their identifiers) and it will return an OWL/RDF representation of an extract from SNOMED-CT around those terms.

So, below is an example of running the command-line to extract a section around the term Diastolic Hypertension and piping the result to the FuXi commandline in order to select a single class (sno:HypertensiveDisorderSystemicArterial) and render it using (my preferred syntax for OWL: the Manchester OWL syntax):

$python ManageSNOMED-CT.py -e 48146000 -n short -s localhost -u ..mysql username.. --password=..mysql password.. -d snomed-ct | FuXi --ns=sno=tag:info@ihtsdo.org,2007-07-31:SNOMED-CT# --output=man-owl --class=sno:HypertensiveDisorderSystemicArterial --stdin
Class: sno:HypertensiveDisorderSystemicArterial
    ## Primitive Type (Hypertensive disorder) ##
    SNOMED-CT Code: 38341003 (a primitive concept)
    SubClassOf:
              Clinical finding
              Disease
              ( sno:findingSite some Systemic arterial structure )

Which renders an expression that can be paraphrased as

‘Hypertensive Disorder Systemic Arterial’ is a clinical finding and disease whose finding site is some structure of the systemic artery.

I can also take the Burn of skin example from the Wikipedia page on SNOMED and demonstrate the same thing, rendering it in its full (verbose) OWL/RDF/XML form:

<owl:Class rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#BurnOfSkin">
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty>
        <owl:ObjectProperty rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#findingSite"/>
      </owl:onProperty>
      <owl:someValuesFrom rdf:resource="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#SkinStructure"/>
    </owl:Restriction>
    <rdf:Description rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#ClinicalFinding"/>
    <owl:Restriction>
      <owl:onProperty>
        <owl:ObjectProperty rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#associatedMorphology"/>
      </owl:onProperty>
      <owl:someValuesFrom rdf:resource="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#BurnInjury"/>
    </owl:Restriction>
    <rdf:Description rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#Disease"/>
  </owl:intersectionOf>
  <rdfs:label>Burn of skin</rdfs:label>
  <skos:scopeNote>disorder<skos:scopeNote>
  <skos:prefSymbol>284196006</skos:prefSmbol>
</owl:Class>

And then in its more palatable Manchester OWL form:

$ python ManageSNOMED-CT.py -e 284196006 -n short -s localhost -u ..username.. --password= -d snomed-ct | FuXi --ns=sno=tag:info@ihtsdo.org,2007-07-31:SNOMED-CT# --output=man-owl --class=sno:BurnOfSkin --stdin
Class: sno:BurnOfSkin
    ## A Defined Class (Burn of skin) ##
    SNOMED-CT Code: 284196006
    EquivalentTo:
      ( sno:ClinicalFinding and sno:Disease ) that
      ( sno:findingSite some Skin structure ) and (sno:associatedMorphology some Burn injury )

Which can be paraphrased as:

A clinical finding or disease whose finding site is some skin structure and whose associated morphology is injury via burn

The examples above use the ‘-n short’ option, which renders extracts in OWL via the short normal form which uses a procedure described in the SNOMED-CT manuals that produces a more canonical representation, eliminating redundancy in the process. It currently only works with the 2007-07-31 distribution of SNOMED-CT but I’m in the process of updating it to use the latest distribution. The latest distribution comes with its own OWL representation and I’m still trying to wrap my head around some quirks in it involving role groups and whether or not this library would need to change so it works directly off this OWL representation instead of the primary relational distribution format. Enjoy,  

Coarser Grain Linked CRUD Data

Coarser Grain Linked Data

I've written before about my thoughts about the Linked Data movement and my concerns with its mandate regarding the kinds of URIs you use in your RDF.  I think, though well intentioned, that particular constraint is a bit too tight and doesn't consider the (hidden) expense  and distraction of forcing a producer of RDF content to ensure that all his or her URIs have a web presence. 

However, i've been thinking about what a RESTful protocol for interacting with an RDF dataset would look like, Maybe requiring HTTP URIs works better at a differernt level of granularity.

The Semantic Web is meant to be an extension of the Web.  The architecture of the web is abstracted in the REST style. 

The REST interface is designed to be efficient for large-grain hypermedia data. 

So, if the Semantic Web is an extension of the Web, then a RESTful focus of protocol interactions with RDF content should also occur at a large grain.  An abundance of dereferencable URIs as the terms of the RDF statements in a graph and the random outbound traversal along  them can lead to an unnecessary (and very redundant) load on the origin server if the consumer of the graph assumes all RDF vocabulary tokens denote things with a web presence.  The cache-ability of the REST style (due to its statelessness) does offset some of this burden, but consider the hypertext analogy of (fine grained) Linked Data.  Imagine if a browser were to automatically load all the outbound a/@href links in an (X)HTML document independent of which ones the user choses to click.  Even with the ability to cache responses to those requests, it still is an unnecessary burden on the browser (and origin server).

Now consider applying the same principle at a coarser, larger grain: the URIs of the named graphs in an RDF dataset.  The relationship between an RDF graph in an RDF dataset, the RDF document that serializes that RDF graph, and the referent (the thing identified by) of the graph URI is confusing.  However, it can be better understood if you think of the relationship between the RDF graph (a knowledge representation) and its mathematical interpretations.

The RDF document, the concrete syntax representation of an RDF graph, is what is passed around over HTTP.  The graph denotes some semantic web knowledge (interpreted via mathematical logic) and we might want to manipulate that knowledge through the request for and dispatching of RDF documents in various formats: N3, Turtle, NT, Trix, Trig, RDF/XML over HTTP.

If I wanted to implement a Facebook social network as a semantic web, I would store all the knowledge / information about a person in an RDF graph so it can be managed over the hypermedia protocol of the web.  Higher-order RDF vocabularies such as OWL2-RL, RDFS, OWL2-DL, etc. can be used to describe Facebook content as a complex domain.  The domain can describe what a Facebook account holder would store about the things he has asserted knows, and likes relationships against for instance.  So, I'd definitely want to be able to get useful information from requests and responses over HTTP against Facebook account identifiers.  The transitive (Web) closure of such a facebook graph up to a certain recursion depth would be useful to have along a fb:knows predicate.  This is an example of a graph link that is useful to a web agent capable of interpreting RDF content: a semantic web agent.

The hypertext analogy of coarse-grained Linked Data would basically be your current experience browsing Facebook content in your browser: you (for the most part) see updates regarding only the people you know alone, not everything that was said by anybody in the complete facebook dataset.  However, the RDF people known by those you know, and RDF things described by friends of yours and their friends, are useful and worth an attempt to 'interpret' in order to determine (for instance) how to display them in the browser or other such entailments from description of content.