Triclops gets a facelift, new query management capabilities, and new APIs

by Chimezie Ogbuji

I recently had a need to manage a set of queries against an OWL2 EL biomedical ontology: the Foundational Model of Anatomy. I have an open source SPARQL service implementation that I had some thoughts about extending with support for managing queries. It’s called Triclops and is part of a collection of RDF libraries and tools I have been accumulating. The name is a reference to an initial attempt to build an RDF querying and navigation interface as part of the 4Suite repository back in the day (circa 2002).

This later evolved to a very rudimentary web interface that sat in front of the Oracle 11g and MySQL/SPARQL patient dataset that Cyc’s SKSI interacted with. This was part of an interface tailored to the task of identifying patient cohorts, known as the Semantic Research Assistant (SRA). A user could dispatch handwritten SPARQL queries, browse clickable results, or return them as CSV files. This capability was only used by informaticians familiar with the structure of the RDF dataset and most investigators used the SRA.

It also implemented a RESTful protocol for ticket-based querying that was used for stopping long-running SPARQL/MySQL queries. This is not currently documented. Around the time this was committed as an Apache-licensed, Google code library, layercake-python added core support for APIs that treated remote SPARQL services as local Graph objects as well as general support for connecting SPARQL services. This was based on Ivan Herman’s excellent SPARQL Endpoint interface to Python.

Triclops (as described in the wiki) can now be configured as a “Proxy SPARQL Endpoint”. It can be deployed as a light-weight query dispatch, management, and mediation kiosk for remote and local RDF datasets. The former capability (dispatching) was already in place, the latter (mediation) can be performed using FuXi’s recent capabilities in this regard.

Specifically, FuXi includes an rdflib Store that uses its sideways-information passing (sip) strategies the in-memory SPARQL algebra implementation for use as a general-purpose framework for semweb SPARQL (OWL2-RL/RIF/N3) entailment regimes. Queries are mediated over the SPARQL protocol using global schemas captured as various kinds of semweb ontology artifacts (expressed in a simple Horn form) that describe and distinguish their predicates by those instantiated in a database (or factbase) and those derived via the semantic properties of these artifacts.

So the primary capability that remained was query management and so this recent itch was scratched over the holidays. I discovered that CodeMirror , a JavaScript library that can be used to create a relatively powerful editor interface for code, had excellent support for SPARQL. I integrated it into Triclops as an interface for managing SPARQL queries and their results. I have a running version of this at http://metacognition.info/sparql/queryMgr. Note, the service is liable to break at any point as Webfaction kills of processes that use up alot of CPU and I have yet to figure out how to configure it to restart the service when it dies in such a fashion.

The dataset this interface manages queries for is a semantic web of content comprising 3 of the primary, ancient Chinese, classical texts (the Analects, Doctrine of the Mean, and the Tao Te Ching). I record the information in RDF because it is an intuitive knowledge representation to use in capturing provenance, exposition, and other editorial meta data. Below is a screen shot of the main page listing a handful of queries, their name, last date of modification, date of last run, and number of solutions in the recent result.

Main SPARQL service page

Above the list is a syntax-highlighted text area for dispatching adhoc SPARQL queries. This is where CodeMirror is integrated. If I click on the name of the query titled “Query for Analects and the Doctrine of the Mean english chapter text (Confucius)”, I go to a similar screen with another text area whose content corresponds to the text of the query (see the screen shot below).

Main SPARQL service page

From here queries can be updated (by submitting updated CodeMirror content) or cloned (using the name field for the new copy). Alternatively, the results of previous queries can be rendered. This sends back a result document with an XSLT processing instruction that causes the browser to trigger a request for a stylesheet and render an XHTML document from content in the result document on the client side.

Finally, a query can be re-executed against a dataset, saving the results and causing the information in the first screen to show different values for the last execution run (date and number of solutions). Results can also be saved or viewed as CSV using a different stylesheet against the result document.

The last capability added is a rudimentary template system where any variable in the query or text string of the form ‘$ …. $’ is replaced with a provided string or a URI. So, I can change the pick list value on the second row of the form controls to $searchExpression$ and type “water”. This produces a SPARQL query (visible with syntax highlighting via CodeMirror) that can be used as a template to dispatch queries against the dataset.

In addition, solutions for a particular variable can be used for links, providing a small framework for configurable, navigation workflows. If I enter “[Ww]ater” in the field next to $searchExpression$, select classic from the pick list at the top of the Result navigation template area, pick “Assertions in a (named) RDF graph” from the next pick list, and enter the graphIRI variable in the subsequent text input.

Triggering this form submission will produce the result screen pictured below. As specified in the form, clicking any of the the dbpedia links for the Doctrine of the Mean will initiate the invokation of the query titled “Assertions in a (named) RDF graph”, and shown below (with the graphIRI variable pre-populated with the corresponding URI):

SELECT DISTINCT ?s ?p ?o where {
    GRAPH ?graphIRI {
      ?s ?p ?o
    }
}

Main SPARQL service page

The result of such an action is shown in the screen shot. Alternatively, a different subsequent query can be used: “Statements about a resource”. The relationship between the schema of a dataset and the factbase can be navigated in a similar way. Picking the query titled “Classes in dataset” and making the following modifications. Select “Instances of a class and graph that the statements are asserted in” from the middle pick list of the Result navigation template section. Enter ?class in the text field to the right of this. Selecting ‘Execute..’ and executing this query results in a clickable result set comprised of classes of resources and clicking any such link shows the instances of that class.

Main SPARQL service page

This latter form of navigation seems well suited for exploring datasets for which either there is no schema information in the service or it is not well known by the investigator writing the queries.

In developing this interface, at least 2 architectural principles were re-used from my SemanticDB development days: the use of XSLT on the client side to build rich, offloaded (X)HTML applications and the use of the filesystem for managing XML documents rather than a relational database. The latter (use of a filesystem) is particularly more relevant where querying across the documents is not a major requirement or even a requirement at all. The former is via setting the processing instruction of a result document to refer to a dynamically generated XSLT document on the server.

The XSLT creates a tabular, row-distinguishing, tabular interface where the links to certain columns trigger queries via a web API that takes various input, including: the variable in the current query whose solutions are ‘streamed’, a (subsequent) query specified by some function of the MD5 hash of its title, a variable in that query that is pre-populated with the corresponding solution, etc:

../query=...&action=update&innerAction=execute,templateValue=...,&valueType=uri&variable=..

Eventually, the API should probably be made more RESTful and target the query, possibly leveraging some caching mechanism in the process. Perhaps it can even work in concert with the SPARQL 1.1 Graph Store HTTP Protocol.

IEEE IC Special Issue is Out


by Chimezie Ogbuji

Ogbuji, Chimezie;   Gomadam, Karthik;   Petrie, Charles;  
Case Western Reserve University 

This paper appears in: Internet Computing, IEEE
Issue Date: July-Aug. 2011
Volume: 15 Issue:4
On page(s): 10 - 13
ISSN: 1089-7801
Digital Object Identifier: 10.1109/MIC.2011.99 
Date of Current Version: 2011-06-30 10:41:12.0
Sponsored by: IEEE Computer Society 

Abstract

Contemporary Web-based architectures can help address the technological and architectural challenges inherent to modern personal health record (PHR) systems. Current research in the area of healthcare informatics has focused on incorporating Web-based technology for PHR systems' primary functions. This special issue presents work in this area of research.

 

I received my complementary copy of this IEEE IC with the special issue on Personal Health Records that I was guest editor for. It turned out well in the end.

A Role for Semantic Web Technologies in Patient Record Data Collection

I found out today that not only is the Linking Enterprise Book now available but it is also freely available online as well as in other avenues (Springer and pre-order on Amazon):

Linking Enterprise Data is the application of Semantic Web architecture principles to real-world information management issues faced by commercial, not-for-profit and government enterprises.This book aims to provide practical approaches to addressing common information management issues by the application of Semantic Web and Linked Data research to production environments.

 

I wrote a chapter ("A Role for Semantic Web Technologies in Patient Record Data Collection") discussing the debate around SOAP-based web services and Representational State Transfer (REST) that focuses on a specific, deployed use case that emphasizes the role of the Semantic Web, a simple Web application architecture that leverages the use of declarative XML processing, and the needs of a workflow system for patient record data collection.  It touches just a bit some of the use of XForms to manage patient record content as special-purpose XML dialects for RDF graphs that I mentioned in my last post but is mostly focused on how to use RDF to manage workflow state to orchestrate data collection of patient data.

Business Process Management Systems (BPMS) are a component of the stack of Web standards that comprise Service Oriented Architecture (SOA). Such systems are representative of the architectural framework of modern information systems built in an enterprise intranet and are in contrast to systems built for deployment on the larger World Wide Web. The REST architectural style is an emerging style for building loosely coupled systems based purely on the native HTTP protocol. It is a coordinated set of architectural constraints with a goal to minimize latency, maxi- mize the independence and scalability of distributed components, and facilitate the use of intermediary processors. Within the development community for distributed, Web-based systems, there has been a debate regarding the merits of both approaches. In some cases, there are legitimate concerns about the differences in both architec- tural styles. In other cases, the contention seems to be based on concerns that are marginal at best. 

In this chapter, we will attempt to contribute to this debate by focusing on a specific, deployed use case that emphasizes the role of the Semantic Web, a simple Web application architecture that leverages the use of declarative XML processing, and the needs of a workflow system. The use case involves orchestrating a work process associated with the data entry of structured patient record content into a research registry at the Cleveland Clinic’s Clinical Investigation department in the Heart and Vascular Institute

Why XML-based web forms are an excellent platform for clinical data entry into RDF

Uche and I have written a bit on XForms on copia. I've recently been motivated to better articulate why I think the use of XForms, Plain Old XML (POX), and GRDDL (or faithful renditions of RDF graphs if you will) is a more robust web architecture for managing mutable RDF content for the purpose of research data management than other thin-client approaches, for instance.

Some time ago, I asked:

Are there examples of tools or architectures that demonstrate support for the Model View Controller (MVC) paradigm for data entry directly against RDF content? It seems to be that there is an inherent impedance mismatch with that is needed for an efficient, documented-hosted, binding-oriented architecture for data entry and the amorphous nature of RDF as well as the cost of using RDF querying as a mechanism for binding data to UI elements. 

In my experience since 2006 as a software architect of web applications that use XForms to manage patient record documents as RDF graphs, I've come to appreciate that the 'CRUD problem' of RDF might have good protocol solutions being developed right now, but the question of whether there is anything more robust for forms-based data collection than declarative, auto-generated XForms that mange RDF datasets is a more difficult one, I think.

My personal opinion is that the nature of the abstract syntax of an RDF graph (as opposed to the tree underlying the XML infoset), its impact on binding RDF resources to widgets, and the ubiquitous use of warehouse relational schemas as infrastructure for RDF datasets in databases will always be an insurmountable performance impediment for alternative solutions at larger volumes that are more robust than using XForms to manage an XML collection on a filesystem as a faithful rendition of an RDF dataset.

RDF/SQL databases are normalized and optimized more for read than for write - with asymptotic consequences to write operations. An architecture that directly manages very large numbers (millions) of RDF triples will be faced with this challenge. The OLTP / OLAP divide in legacy relational architecture is analogous to the use of XML and RDF in those respective roles and is an intuitive architectural style for using knowledge representation in content management systems. GRDDL and its notion of faithful  renditions can be used to manage this divide as infrastructure for contemporary content management systems. 

For the purpose of read-only browsing, however, RDF lenses and facets are a useful alternative. However, if you need support for controlled vocabularies, heavily-dependent constraint validation, declarative and auto-generated templating, and large amounts of concurrent data entry over large amount of RDF data, the rich web architecture backplane is just very robust in my experience and in others as well. 

I had to dig into the way back machine to find the XML technologies presentation John and I were supposed to give in December of 2007 (right before my life changed forever). I need to bug him to put copies of those slides on his weblog about using XForms with schematron for real-time validation as a component of data entry.

IEEE Internet Computing Special Issue: Web Technology and Architecture for Personal Health Records

IEEE Internet Computing is soliciting original articles describing the development of, relevant trends, and challenges incorporating contemporary Web-based technology for the primary functions of Personal Health Record (PHR).  Of particular interest are PHR systes that capture healthcare data entered by patients themselves: Personally Controlled Health Records (PCHR).  If you are interested please email either of the guest editors: Me (chimezie@gmail.com / cut@case.edu), Karthik Gomadam (karthik@knoesis.org), or Charles Petrie (petrie@stanford.edu).

Please email the guest editors a brief description of the article you plan to submit by 15 October 2010.  Final submissions are due on the first of November 2010.

The main functional categories of interest are information collection, sharing, exchange, and management.

Appropriate topics of interest include

  • Web-based, structured data collection in PHR systems
  • implementations of access-control policies and healthcare data sharing
  • distributed, identity-based authentication methods
  • digital signature and encryption techniques
  • Web portal architecture’s general components and capabilities as the basis for a PHR system
  • architectural paradigms regarding connectivity to other healthcare information producers and consumers
  • data models for PHR systems
  • distributed data subscription and publishing protocols
  • successful Web-based applications for chronic disease and medication management
  • health applications for PHR systems on mobile devices
  • privacy and security issues
  • HIPAA and its implications for adopting cloud computing for PHR applications
  • semantics for PHR interoperability and applications

All submissions must be original manuscripts of fewer than 5,000 words, focused on Internet technologies and implementations. All manuscripts are subject to peer review on both technical merit and relevance to IC’s international readership — primarily system and software design engineers. We do not accept white papers, and we discourage strictly theoretical or mathematical papers.

To submit a manuscript, please log on to Manuscript Central to create or access an account, which you can use to log on to IC‘s Author Center and upload your submission.

Ontological Definitions for an Information Resource

I've somehow found myself wrapped-up in this dialog about information resources, their representations, and the relation to RDF. Perhaps it's the budding philosopher in me which finds the problem interesting. There seems to be some controversy about what is an appropriate definition for an information resource. I'm a big fan of not reinventing wheels if they have already been built, tested, and deployed.

The Architecture of the World-Wide Web says:

The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as "information resources."

I know of at least 4 very well-organized upper ontologies which have readily-available OWL representations: SUMO, Cyc, Basic Formal Ontology, and DOLCE. These are the cream of the crop in my opinion (and in the opinion of many others who are more informed about this type of thing). So, let us spend some time investigating where the poorly-defined Web Architecture term fits in these ontologies. This exercise is mostly meant for the purpose of reference. Every well-organized, upper ontology will typically have a singular, topmost term which covers everything. This would be (for the most part) the equivalent of owl:Thing and rdf:Resource

Suggested Upper Merged Ontology (SUMO)

Sumo has a term called "FactualText" which seems appropriate. The definition states:

The class of Texts that purport to reveal facts about the world. Such texts are often known as information or as non-fiction. Note that something can be an instance of FactualText, even if it is wholly inaccurate. Whether something is a FactualText is determined by the beliefs of the agent creating the text.

The SUMO term has the following URI for FactualText (at least in the OWL export I downloaded):

http://reliant.teknowledge.com/DAML/SUMO.owl#FactualText

Climbing up the subsumption tree we have the following ancestral path:

  • Text: "A LinguisticExpression or set of LinguisticExpressions that perform a specific function related to Communication, e.g. express a discourse about a particular topic, and that are inscribed in a CorpuscularObject by Humans."

The term Text has multiple parents (LinguisticExpression and Artifact). Following the path upwards from the first parent we have:

  • LinguisticExpression: "This is the subclass of ContentBearingPhysical which are language-related. Note that this Class encompasses both Language and the the elements of Languages, e.g. Words."
  • ContentBearingPhysical: "Any Object or Process that expresses content. This covers Objects that contain a Proposition, such as a book, as well as ManualSignLanguage, which may similarly contain a Proposition."
  • Physical "An entity that has a location in space-time. Note that locations are themselves understood to have a location in space-time."
  • Entity "The universal class of individuals. This is the root node of the ontology."

Following the path upwards from the second parent we have:

  • Artifact: "A CorpuscularObject that is the product of a Making."
  • CorpuscularObject: "A SelfConnectedObject whose parts have properties that are not shared by the whole."
  • SelfConnectedObject: "A SelfConnectedObject is any Object that does not consist of two or more disconnected parts."
  • Object: "Corresponds roughly to the class of ordinary objects. Examples include normal physical objects, geographical regions, and locations of Processes"

Objects are a specialization of Physical, so from here we come to the common Entity ancestor

Cyc

Cyc has a term called InformationBearingThing:

A collection of spatially-localized individuals, including various actions and events as well as physical objects. Each instance of information-bearing thing (or IBT ) is an item that contains information (for an agent who knows how to interpret it). Examples: a copy of the novel Moby Dick; a signal buoy; a photograph; an elevator sign in Braille; a map ...

The Cyc URI for this term is:

http://sw.cyc.com/2006/07/27/cyc/InformationBearingThing

This term has 3 ancestors: Container-Underspecified, SpatialThing-Localized, and InformationStore. The latter seems most relevant, so we'll traverse its ancestry first:

  • InformationStore : "A specialization of partially intangible individual. Each instance of store of information is a tangible or intangible, concrete or abstract repository of information. The information stored in an information store is stored there as a consequence of the actions of one or more agents."
  • PartiallyIntangibleIndividual : "A specialization of both individual and partially intangible thing. Each instance of partially intangible individual is an individual that has at least some intangible (i.e. immaterial) component. The instance might be partly tangible (e.g. a copy of a book) and thus be a composite tangible and intangible thing, or it might be fully intangible (e.g. a number or an agreement) and thus be an instance of intangible individual object. "

From here, there are two ancestral paths, so we'll leave it at that (we already have the essense of the definition).

Going back to InformationBearingThing, below is the ancestral path starting from Container-Underspecified:

  • Container-Underspecified : "The collection of objects, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as containers. Thus, container underspecified includes not only the set of all physical containers, like boxes and suitcases, but metaphoric containers as well"
  • Area: "The collection of regions/areas, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as spatial regions."
  • Location-Underspecified: Similar definition as Area
  • Thing: "thing is the universal collection : the collection which, by definition, contains everything there is. Every thing in the Cyc ontology -- every individual (of any kind), every set, and every type of thing -- is an instance of (see Isa) thing"

Basic Formal Ontology (BFO)

BFO is (as the name suggests) very basic and meant to be an axiomatic implementation of the philosophy of realism. As such, the closest term for an information resource is very broad: Continuant

Definition: An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.

However, I happen to be quite familiar with an extension of BFO called the Ontology of Biomedical Investigation (OBI) which has an appropriate term (derived from Continuant): information_content_entity

The URI for this term is:

http://obi.sourceforge.net/ontology/OBI.owl#OBI_342

Traversing the (short) ancestral path, we have the following definitions:

  • OBI_295 : "An information entity is a dependent_continuant which conveys meaning and can be documented and communicated."
  • OBI_321 : "generically_dependent_continuant"
  • Continuant : "An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts."
  • Entity

The Descriptive Ontology of Linguistics and Cognitive Engineering (DOLCE)

DOLCE's closest term for an information resource is information-object:

Information objects are social objects. They are realized by some entity. They are ordered (expressed according to) by some system for information encoding. Consequently, they are dependent from an encoding as well as from a concrete realization.They can express a description (the ontological equivalent of a meaning/conceptualization), can be about any entity, and can be interpreted by an agent.From a communication perspective, an information object can play the role of "message". From a semiotic perspective, it playes the role of "expression".

The URI for this term is:

http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#information-object

Traversing the ancestral path we have:

  • non-agentive-social-object: "A social object that is not agentive in the sense of adopting a plan or being acted by some physical agent. See 'agentive-social-object' for more detail."
  • social-object: "A catch-all class for entities from the social world. It includes agentive and non-agentive socially-constructed objects: descriptions, concepts, figures, collections, information objects. It could be equivalent to 'non-physical object', but we leave the possibility open of 'private' non-physical objects."
  • non-physical-object : "Formerly known as description. A unitary endurant with no mass (non-physical), generically constantly depending on some agent, on some communication act, and indirectly on some agent participating in that act. Both descriptions (in the now current sense) and concepts are non-physical objects."
  • non-physical-endurant: "An endurant with no mass, generically constantly depending on some agent. Non-physical endurants can have physical constituents (e.g. in the case of members of a collection)."
  • endurant : "The main characteristic of endurants is that all of them are independent essential wholes. This does not mean that the corresponding property (being an endurant) carries proper unity, since there is no common unity criterion for endurants. Endurants can 'genuinely' change in time, in the sense that the very same endurant as a whole can have incompatible properties at different times."
  • particular: "AKA 'entity'.Any individual in the DOLCE domain of discourse. The extensional coverage of DOLCE is as large as possible, since it ranges on 'possibilia', i.e all possible individuals that can be postulated by means of DOLCE axioms. Possibilia include physical objects, substances, processes, qualities, conceptual regions, non-physical objects, collections and even arbitrary sums of objects."

Discussion

The definitions are (in true philosophical form) quite long-winded. However, the point I'm trying to make is:

  • Alot of pain has gone into defining these terms
  • Each of these ontologies is very richly-axiomatized (for supporting inference)
  • Each of these ontologies is available in OWL/RDF

Furthermore, these ontologies were specifically designed to be domain-independent and thus support inference across domains. So, it makes sense to start here for a decent (axiomatized) definition. What is interesting is that SUMO and BFO are the only upper ontologies which treat information resources (or their equivalent term) as strictly 'physical' things. Cyc's definition includes both tangible and intangible things while DOLCE's definition is strictly intangible (non-physical-endurant)

Some food for thought

Chimezie Ogbuji

via Copia

Why Web Architecture Shouldn't Dictate Meaning

This is a very brief demonstration motivated by some principled arguments I've been making over the last week or so regarding Web Architecture dictates which are ill-concieved and may do more damage to the Semantic Web than good. A more fully articulated argument is sketched out in "HTTP URIs are not Without Expense" and "Semiotics of RDF Signs". In particular, the argument about why most of the httpRange-14 dialog is confusing dereference with denotation. I've touched on some of this before.

Anywho, the URI I've minted for myself is

http://metacognition.info/profile/webwho.xrdf#chime

When you 'dereference' it, the server responds with:

chimezie@otherland:~/workspace/Ontologies$ curl -I http://metacognition.info/profile/webwho.xrdf#chime
HTTP/1.1 200 OK
Date: Thu, 30 Aug 2007 06:29:12 GMT
Server: Apache/2.2.3 (Debian) DAV/2 SVN/1.4.2 mod_python/3.2.10 Python/2.4.4 PHP/4.4.4-8+etch1 proxy_html/2.5 mod_ssl/2.2.3 OpenSSL/0.9.8c mod_perl/2.0.2 Perl/v5.8.8
Last-Modified: Mon, 23 Apr 2007 03:09:22 GMT
Content-Length: 6342
Via: 1.1 www.metacognition.info
Expires: Thu, 30 Aug 2007 07:28:26 GMT
Age: 47
Content-Type: application/rdf+xml

According to TAG dictate, a 'web' agent can assume it refers to a document (yes, apparently an RDF document composed this blog you are reading).

Update: Bijan points out that my example is mistaken. The TAG dictate only allows the assumption to be made of the URI which goes across the wire (the URI with the fragment stripped off). The RDF (FOAF) doesn't make any assertions about this (stripped) URI being a foaf:Person. This is technically correct, however, the concern I was highlighting still holds (albeit it is more likely to confuse folks who are already confused about dereference and denotation). The assumption still gets in the way of 'proper' interpretation. Consider if I had used the FOAF graph URL as the URL for me. Under which mantra would this be taboo? Furthermore, if I wanted to avoid confusing unintelligent agents such as this one above, which URI scheme would I be likely to use? Hmmm...

Okay, a more sophisticated semantic web agent parses the RDF and understands (via the referential mechanics of model theory) that the URI denotes a foaf:Person (much more reasonable). This agent is also much better equipped to glean 'meaning' from the model-theoretic statements made about me instead of jumping to binary conclusions.

So I ask you, which agent is hampered by a dictate that has all to do with misplaced pragmatics and nothing to do with semantics? Until we understand that the 'Semantic Web' is not 'Web-based Semantics', Jim Hendler's question about where all the agents are (Where are all the agents?) will continue to go unanswered and Tim Bray's challenge will never be fulfilled.

A little tongue-in-cheek, but I hope you get the point

Chimezie Ogbuji

via Copia