My Thoughts after 7 Years of Being a Web-based Patient Registry Architect

Monday January 17th, 2011 is my last day at the Cleveland Clinic where I was Lead Systems Analyst for 7 years working on a very exciting project with a goal to replace a relational Cardiovascular Information Registry that supported research in cardiovascular medicine and surgery and consolidate data management for research purposes in the surgery department.  Eventually, our Clinical Investigation unit became part of the Heart and Vascular Institute.  

The long-term goal was to create a framework for context-free data management systems in which expert-provided, domain-specific knowledge is used to control all aspects of data entry, storage, display, retrieval, communication, and formatting for external systems.  By ‘context-free’, I mean that the framework can be used for any domain (even outside of medicine) and nothing about the domain is assumed or hardcoded.  The use of metadata was envisioned as key to facilitating this capability and to this end the use of RDF was effective as a web and logic-based knowledge representation.

At the time, I was unemployed soon after the post 9-11 dot.com economic and innovation bubble wherein there was great risk aversion to using the emerging technologies of the time: XML, RDF, XSLT, Python, REST, etc.  I was lucky that in the city whereI was born, a couple of miles (literally) from where I was born and my mother worked there was a great job opportunity under a mandate from our director for a de novo, innovative architecture.  I went for the interview and was fortunate to get the job.

Four years later (in the fall of 2007), It was deployed for production use by nurses, coders, and researchers and built on top of the predecessor of Akara, an XML & RDF content repository:  

The diagram above documents the application server and services stack that is used by the various components that interface with the repository.  The web services paradigm was never used and most of the service-oriented architecture was pure HTTP with XML (i.e., POX/HTTP)

Mozilla’s Firefox with XForms extension was used for about 5 years for the complete collection of longitudinal patient record content for a little over 200,000 patients who had operations with cardiac(-thoracic) surgical component(s) in a registry.  The architectural methodology was such that the use and deployment of infrastructure (entire data collection screens, XML and RDF schemas, data transformations, etc.) was significantly automated.  W3C document management and semantic web representation standards (HTTP, RDF, XML, N3, OWL, and SPARQL) were used to ensure interoperability, extensibility, and automation, in particular as infrastructure for both (certified) quality measure reporting and a clinical research repository.

I wanted to take the time to coalesce my experience in working on this project and share some of the key desiderata, architectural constraints (that perhaps comprise an architectural style), and opportunities in using these emerging technologies to address the primary engineering challenges of clinical research repositories and patient registries.  The following functionalities stood out as very implementable for such an approach: 

  • patient record abstraction (data entry via knowledge-generated input screens)
  • workflow management
  • data quality management
  • identification of patients cohorts for study
  • data export for statistical analysis

Patient Record Abstraction

XForms brings a rich history of browser-based data entry to bear as comprehensive, declarative syntax that is part of a new architectural paradigm and works well with the architectural style of the World Wide Web: REST.  It abstracts widgets, controls, data bindings, logic, remote data management, integration into a host language, and other related rich internet application requirements.  

I had to check the way back machine but found the abstract of the 2006 presentation: “The Essence of Declarative, XML-based Web Applications: XForms and XSLT”.  There I discussed best practices, common patterns, and pitfalls in using XSLT as a host language for generating web-based user interfaces expressed in XForms.  The XForms-generating infrastructure was quite robust and everything from screen placement, behavior, range checking, drop-down lists, and data field dependencies were described in an XML document written using a microformat designed for templating user interfaces for patient registry data collection.

All in all (and I hope John writes about this at some point since he was the primary architect of the most advanced manifestation of this approach), the use of a microformat for documenting and generating an XForms framework for controlled, validated, form-based data collection was very adequate and sufficed to allow the (often unpredictable) requirements of national registries reporting and our clinical studies to dictate automatically deployed changes to a secure (access controlled) patient record web portal and repository.

Regarding quality management, on December 2007, John and I were supposed to present at XML 2007 about how

XForms provides a direct interface to the power of XML validation tools for immediate and meaningful feedback to the user about any problems in the data. Further, the use of validation components enables and encourages the reuse of these components at other points of entry into the system, or at other system boundaries.

I also found the details of the session (from the online conference schedule) on the wayback machine from the XML 2007 conference site.  The presentation was titled: “Analysis of an architecture for data validation in end-to-end XML processing systems.

Client-side XForms constraint mechanisms as well as server-side schematron validation was the basis for quality management at the point of data entry (which is historically the most robust way to address errors in medical record content).  All together, the Mozilla browser platform presents quite an opportunity to offload sophisticated content management capabilities from the server to the client and XML processing and pipelining played a major role in this regard.  

Workflow Management

See: A Role for Semantic Web Technologies in Patient Record Data Collection where I described my chapter in the LInked Enterprise Data book regarding the use of semantic web technologies to facilitate patient record data collection workflow.  In the Implementation section, I also describe how declarative AJAX frameworks such as Simile Exhibit were integrated for use in faceted browsing and vizualization of patient record content.  RDF worked well as the state machine of a workflow engine.  The digital artifacts involved in the workflow application and the messages sent back and forth from browser to server are XML and JSON documents.  The content in the documents are mirrored into an RDF dataset describing the state of the workflow task which can be queried when listing the workflow tasks associated with the currently logged in user (for instance).  

This is an archetype of an emerging pattern where XML is used as the document and messaging syntax and a semantics-preserving RDF rendering of the XML content (mirrored persistently in an RDF dataset) is used as the knowledge representation for inference and querying.  More on this archetype later.  

Identification of Patient Cohorts for Study

Eric Prud’hommeaux has been doing alot of excellent infrastructure work (see: SWObjects) in the Semantic Web for Healthcare and Life Sciences Interest Group around the federated use of SPARQL for querying structured, discrete EHR data in a meaningful way and in a variety of contexts: translational research, clinical observations interoperability, etc.  His positive experience has mirrored ours in the use of SPARQL as the protocol for querying a patient outcome registry and clinical research database.

However, we had the advantage of already having the data natively available as a collection of RDF graphs (each of which describes a longitudinal patient record) from the registry.  A weekly ETL-like process recreates an RDF dataset from the XML collection and serves as the snapshot data warehouse for the operational registry, which relies primarily on XML payload for the documentation, data collection, and inter-system messaging needs.  Other efforts have been more focused on making extant relational EHR data available as SPARQL.  

This was the backdrop to the research we did with Case Western Reserve University P.h.D students in the summer and fall of 2008 on the efficient use of relational algebra to evaluate the SPARQL language in its entirety.  

GRDDL Mechanisms and Dual Representation

Probably the most substantive desiderata or opportunity in patient record systems (and registries) of the future is as hard to articulate as it is to (frankly) appreciate and understand.  However, I recently rediscovered the GRDDL usecase that does a decent job of describing how GRDDL mechanisms can be used to address the dual syntactic and semantic interoperability challenges in patient registry and record systems.

XML is the ultimate structured document and messaging format so it is no surprise it is the preferred serialization syntax of the de facto messaging and clinical documentation standard in healthcare systems.  There are other light-weight alternatives (JSON), but it is still the case that XML is the most pervasive standard in this regard.  However, XML is challenged in its ability to specify the meaning of its content in a context-free way (i.e., semantic interoperability).  Viewing RDF content as a rendering of this meaning ( a function of the structured content ) and viewing an XML vocabulary as a special purpose syntax for RDF content is a very useful way to support (all in the same framework): 

  • structured data collection
  • standardized system-to-system messaging
  • automated deductive analysis and transformation
  • structural constraint validation.

In short, XML facilitates the separation of presentation from content and semantic-preserving RDF renderings of XML facilitate the separation of syntax from semantics.  

The diagram above demonstrates how this separation acts as the modern version of the tranditional boundary between transactional systems and their datawarehouse in relational database systems.  In this updated paradigm, however, XML processing is the framework for the former, RDF processing is the framework for the latter, and XSLT (or any similar transform algorithm) is the ETL framework.

In the end, I think it is safe to say that RDF and XSLT are very useful for facilitating semantic and syntactic automation in information systems.  I wish I had presented or written more on this aspect, but I did find some older documents on the topic along with the abstract of the presentation I gave with William Stelhorn during the first Semantic Technologies Conference in 2005:

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} span.s1 {letter-spacing: 0.0px}

The diagram above illustrates how a domain model (described in an RDF vocabulary) can be used by an automated component to generate semantic and syntactic schemas (OWL and RELAX-NG respectively) as well as GRDDL transforms (as XSLT) that faithfully render the structured content in a machine-understandable knowledge representation (RDF).  In this way, many of the data management tooling and infrastructure can be managed by tweaking declarative documentation (not programming code) and automatically generating the infrastructure compiled specifically for a particular domain.

The Challenges

I have mostly mentioned where things worked well.  There were certainly significant challenges to our project, but most of them were not technical in nature.  This seems to be a recurring theme in medical informaticis.  

The main technical challenge has to do with support in RDF databases or triplestores for an appreciable amount of write as well as read operations.  This shortcoming will be particularly emphasized given the emerging SPARQL 1.1 specifications having to do with write operations becoming W3C recommendations and their inevitable adoption.  Most contemporary relational schemas for RDF storage are highly-normalized star schemas where the there isn't a single fact table but rather (in the most advanced cases) this space is partitioned into separate tables and divided along the general, distinct kinds of statements you can have in an RDF graph: RDF statements where the predicate is rdf:type, statements where the object is an RDF literal, and statements where the object is an RDF URI.  So, it is more like a conjoined star schema, for a lack of a better term.  

This minimizes self-joins since the data space is partitioned amongst multiple tables and a query such as this:

SELECT ?MRN { ?PATIENT hasMedicalRecordNumber ?MRN }  

Would ony require (assuming we knew a priori that the range of the hasMedicalRecordNumber property is RDF literals) using the single table for RDF statements where the object is an RDF literal.  If an additional triple pattern is added that matches RDF statements where the object is a URI (or Blank Node), then this results in a join between the two tables rather than a self-join on a single massive table for all the RDF statements.

In such an architecture, there are separate lookup tables that map internal identifiers (integers perhaps) to their full lexical form as URIs or RDF literals for instance.  This enforces the re-use of terms so that space is not wasted if two RDF statemens refer to the same URI.  Internally, they will refer to the same row in the RDF term lookup table.  This is often called string interning.  

However, the downside to this kind of normalization, which is significantly optimized for usecases that primarily involve read access (i.e., datawarehouse or OLAP scenarios), does not bode well when updates are needed to be made to part of an RDF graph in a RDF dataset or the entire graph.  If an RDF statement is removed and it is the only one that references a particular RDF URI, that URI will need to be garbage collected or removed from the RDF term lookup table.  It is such optimizations for OLAP behavior that almost require that high volume updates to RDF content happen as massive ETL jobs where the entire RDF collection of patient record content is replaced rather than doing so one patient record graph at a time.  

This fundamental challenge is the same reason why (some time back) the content repository underlying SemanticDB switched from using RDF to manage the state of the repository (file modification dates, internet media types associated with artifacts, etc.) to using cached, in-memory, pre-parsed XML.  It is also the same reason why we didn't use this capability to allow modifications to patient record documents (via XForms data collection screens) to be immediately reflected into the underlying RDF dataset.  In both cases, writing to the RDF dataset became the prominent bottleneck as the RDF database was not able to support Online Transactional Processing (OLTP).

There was an additional semi-technical challenge that had (and still has) to do with a lack of a small, broad, uniform, upper ontology of clinical medicine.  It needs to be certainly much smaller than SNOMED-CT but certainly able to cover the same material, hence my description of it as an upper ontology.  There are at least 2 projects I know of regarding this:

The latter is my project and was motivated (from the beginning) by this exact problem.  The other problems are massive but not technical.  The two primary ones (in my opinion) are the lack of understanding of - on the one hand - how to concieve and develop a business strategy around open source, open communities, open standards, and open data that relies more on services, the competitive advantage of using emerging technologies where appropriate, and leverages the catalyzing power of web 2.0 / 3.0 technology and social media and - on the other hand - better communication of the merits of semantic web technologies in addressing the engineering and infrastructure challenges of healthcare information systems.  Most of the people who can benefit from better communication in this regard are significantly risk-averse to begin with (another problem in its own right).  

An example of this is that, in my opinion, the only new innovation that semantic web technologies brings to the table is the symbiotic combination of the architecture of the world wide web with knowledge representation.  The use of logic and ontologies to address semantic interoperability challenges predates the semantic web (and even predates Description Logic).  By being precise in describing this difference you can also be precise about decribing the value with very little risk of over stating it.  

Any problem domain where the meaning of data plays a major role will benefit from tranditional deductive database and expert systems (such as prolog or business rule systems respectively) as easily as it would from semantic web technologies.  However, a problem domain where linking data, identifying concepts and digital artifacts in a universal and re-usable way, and leveraging web-based infrastructure is a major factor will benefit from semantic web technologies in a way that it wouldn't from the tradtional alternatives.  This simplification of the value proposition message (for consumption by risk-averse laypeople) also helps to sharpen the distinctions between the markets that a business strategy can target as well as target the engineering problems these emergning technologies should (and should not) attempt to address.  A sharper, straightforward message is needed to break the political and generational barriers that retard the use of potentionally transformational technologies in this field that is a major contribution to the economic instability of this country.

Many of these technological opportunities transfer directly over for use in Patient Controlled Health Records (PCHR) systems.  I also think much of the risk aversion associated with the atmosphere I found myself in after leaving Fourthought (for instance) and generally in large institutions contributes to why the very evident opportunities in leveraging rich web application (“web 2.0”) and semantic web infrastructure (“web 3.0”) have not had as much penetration in healthcare information technology as one would expect.

My new job is as a Senior Research Associate with the Center for Clinical Investigation in the Case Western School of Medicine.  I will be managing Clinical and Translational Science Collaboration (CTSC) clinical, biomedical, and administrative informatics projects as well as designing and developing the informatics infrastructure that supports this.  Coupled with being a part-time P.h.D. student, I will still essentially be doing clinical research informatics (i.e., the use of informatics to facilitate biomedical and health research), however the focus will be on translational research: translating the findings in basic research more quickly and efficiently into medical practice and meaningful health outcomes: physical, mental, or social.  So, I imagine the domain will be closer to the biology end of the spectrum and there will be more of an explicit emphasis on collaboration.

I have thoroughly enjoyed my time working on such a challenging project, in such a world-renown intitution, and working under such a visionary mandate.  I was priviledged to be able to represent the Cleveland Clnic in the W3C through the various working groups developing standards relevant to the way in which we were leveraging semantic web technologies:

  • Semantic Web for Healthcare and Life Sciences Interest Group
  • Data Access Working Group
  • GRDDL Working Group 

If not for the exposure at CCF to the great challenges in medical informatics and equally great opportunities to address them, I would probably have never considered going back to seek a P.h.D in this field.  Although I will be working with a different instutition, it will still essentially be in the Univiersity Circle area and only about a 20 minute walk from where I was at the Cleveland Clinic.  I'm very proud of what we were able to do and I'm looking forward to the future.

SemanticDB: A CMS Methodology for the Enterprise Semantic Web

This is a heads up that this weekend I will be writing a full length article giving an architectural overview of SemanticDB, a Content Management System methodology (and implementation) we have been developing at the Cleveland Clinic Foundation over the last 4 years (since I've been there). I hope this may trigger more public dialog (which unfortunately has not been happening) about how we've been leveraging Semantic Web and document management (XML-related technologies) W3C standards to build a robust platform to facilitate all aspects of clinical research.

I believe most of our success with SemanticDB can be attributed to the strenghts of the standards being leveraged, the opensource tools (specifically: Python, 4Suite, RDFLib, FuXi, Paste, and FormsPlayer) which serve as primary infrastructure, and the enthusiasm of the relevant communities behind these open standards and opensource software tools. As such, it is only fair that these communities become aware of examples which demonstrate how this arena is slowly transitioning from the realm of pure research to practical problem solving. In addition, I hope it may contribute to the growing body of literature demonstrating concrete problems being solved with these technologies in a way that simply would not have been possible via legacy means.

Below is the abstract from a technological whitepaper that has been in clandestine distribution as we have sought to ramp up our efforts:

SemanticDB represents a methodology for building and maintaining a highly flexible content management system built around a centralized vocabulary.
This vocabulary incorporates a formal semantics (via a combination of an OWL ontology and a N3 ruleset) for heirarchical document composition and an abstract framework for modelling terms in a domain in a way that facilitates semi-automated use of native XML and RDF representations.

It relies on a very recent set of technologies (Semantic Web technologies) to automate certain aspects of data entry, structure, storage, display, and retrieval with minimal intervention by traditional database administrators and computer programmers. A SemanticDB instance is built around a domain model expressed using a Data Node Model. From the domain model various XML/RDF management components are generated. The Data Mason takes a domain model and generates XML schemas, formal ontologies, XSLT transforms, stored queries, XML templates, document definitions, etc.

The ScreenCompiler takes an abstract representation of a data entry form (with references to terms in a domain model) and generates a user interface in a particular host language (currently XForms)

Below is a core diagram of the methodology which facilitates semi-automated data management of XML & RDF content:

The Data Mason and automated XML/RDF Data Management

Watch this space..

Chimezie Ogbuji

via Copia