Why XML-based web forms are an excellent platform for clinical data entry into RDF

Uche and I have written a bit on XForms on copia. I've recently been motivated to better articulate why I think the use of XForms, Plain Old XML (POX), and GRDDL (or faithful renditions of RDF graphs if you will) is a more robust web architecture for managing mutable RDF content for the purpose of research data management than other thin-client approaches, for instance.

Some time ago, I asked:

Are there examples of tools or architectures that demonstrate support for the Model View Controller (MVC) paradigm for data entry directly against RDF content? It seems to be that there is an inherent impedance mismatch with that is needed for an efficient, documented-hosted, binding-oriented architecture for data entry and the amorphous nature of RDF as well as the cost of using RDF querying as a mechanism for binding data to UI elements. 

In my experience since 2006 as a software architect of web applications that use XForms to manage patient record documents as RDF graphs, I've come to appreciate that the 'CRUD problem' of RDF might have good protocol solutions being developed right now, but the question of whether there is anything more robust for forms-based data collection than declarative, auto-generated XForms that mange RDF datasets is a more difficult one, I think.

My personal opinion is that the nature of the abstract syntax of an RDF graph (as opposed to the tree underlying the XML infoset), its impact on binding RDF resources to widgets, and the ubiquitous use of warehouse relational schemas as infrastructure for RDF datasets in databases will always be an insurmountable performance impediment for alternative solutions at larger volumes that are more robust than using XForms to manage an XML collection on a filesystem as a faithful rendition of an RDF dataset.

RDF/SQL databases are normalized and optimized more for read than for write - with asymptotic consequences to write operations. An architecture that directly manages very large numbers (millions) of RDF triples will be faced with this challenge. The OLTP / OLAP divide in legacy relational architecture is analogous to the use of XML and RDF in those respective roles and is an intuitive architectural style for using knowledge representation in content management systems. GRDDL and its notion of faithful  renditions can be used to manage this divide as infrastructure for contemporary content management systems. 

For the purpose of read-only browsing, however, RDF lenses and facets are a useful alternative. However, if you need support for controlled vocabularies, heavily-dependent constraint validation, declarative and auto-generated templating, and large amounts of concurrent data entry over large amount of RDF data, the rich web architecture backplane is just very robust in my experience and in others as well. 

I had to dig into the way back machine to find the XML technologies presentation John and I were supposed to give in December of 2007 (right before my life changed forever). I need to bug him to put copies of those slides on his weblog about using XForms with schematron for real-time validation as a component of data entry.

A Content Repository API for Rich, Semantic Web Applications?

[by Chimezie Ogbuji]

I've been working with roll-your-own content repositories long enough to know that open standards are long overdue.

The slides for my Semantic Technology Conference 2007 session are up: "Tools for the Next Generation of CMS: XML, RDF, & GRDDL" (Open Office) and (Power point)

This afternoon, I merged documentation of the 4Suite repository from old bits (and some new) into a Wiki that I hope to contribute to (every now and then).
I think there is plenty of mature, supporting material upon which a canon of best practices for XML/RDF CMSes can be written, with normative dependencies on:

  • GRDDL
  • XProc
  • Architecture of the World Wide Web, Volume One
  • URI RFCs
  • Rich Web Application Backplane
  • XML / XML Base / XML Infoset
  • RDDL
  • XHTML 1.0
  • SPARQL / Versa (RDF querying)
  • XPath 2.0 (JSR 283 restriction) with 1.0 'fallback'
  • HTTP 1.0/1.1, ACL-based HTTP Basic / Digest authentication, and a convention for Web-based XSLT invokation
  • Document/graph-level ACL granularity

The things that are missing:

  • RDF equivalent of DOM Level 3 (transactional, named graphs, connection management, triple patterns, ... ) with events.
  • A mature RIF (there is always SWRL, Notation 3, and DLP) as a framework for SW expert systems (and sentient resource management)
  • A RESTful service description to complement the current WSDL/SOAP one

For a RESTful service description, RDF Forms can be employed to describe transport semantics (to help with Agent autonomy), or a mapping to the Atom Publishing Protocol (and a thus a subset of GData) can be written.

In my session, I emphasized how closely JSR 283 overlaps with the 4Suite Repository API.

The delta between them mostly has to do with RDF, other additional XML processing specifications (XUpdate, XInclude, etc.), ACL-based HTTP authentication (basic, and sessions), HTTP/XSLT bindings, and other miscellaneous bells and whistles

Chimezie Ogbuji

via Copia

Musings of a Semantic / Rich Web Architect: What's Next?

I'm writing this on my flight back from XTech 2007, Paris, France. This gives me a decent block of time to express some thoughts and recent developments. This is the only significant time I've had in a while to do any writing.
My family

Between raising a large family, software development / evangelism, and blogging I can only afford to do two of these. So, blogging loses out consistently.

My paper (XML-powered Exhibit: A Case Study of JSON and XML Coexistence) is now online. I'll be writing a follow-up blog on how http://planetatom.net demonstrates some of what was discussed in that paper. I ran into some technical difficulties with projecting from Ubuntu, but the paper covers everything in detail. The slides are here.

My blog todo list has become ridiculously long. I've been heads-down on a handful of open source projects (mostly semantic web related) when I'm not focusing on work-related software development.
Luckily there has been a very healthy intersection of the open source projects I work on and what I do at work (and have been doing non-stop for about 4 years). In a few cases, I've spun these 'mini-projects' off under an umbrella project I've been working on called python-dlp. It is meant (in the end) to be a toolkit for semantic web hackers (such as myself) who want to get their hands dirty and have an aptitude for Python. There is more information on the main python-dlp page (linked above).

sparql-p evaluation algorithm Some of the other things I've been working on I'd prefer to submit to appropriate peer-reviewed outlets considering the amount of time I've put into them. First, I really would like to do a 'proper' write-up on the map/reduce approach for evaluating SPARQL Algebra expressions and the inner mechanics of Ivan Herman's sparql-p evaluation algorithm. The latter is one of those hidden gems I've become closely familiar with for some time that I would very much like to examine in a peer-reviewed paper especially if Ivan is interested doing so in tandem =).

Since joining the W3C DAWG, I've had much more time to get even more familiar with the formal semantics of the Algebra and how to efficiently implement it on-top of sparql-p to overcome the original limitation on the kinds of patterns it can resolve.

I was hoping (also) to release and talk a bit about a SPARQL server implementation I wrote in CherryPy / 4Suite / RDFLib for those who may find it useful as a quick and dirty way to contribute to the growing number of SPARQL endpoints out there. A few folks in irc:///freenode.net/redfoot (where the RDFLib developers hang out) have expressed interest, but I just haven't found the time to 'shrink-wrap' what I have so far.

On a different (non-sem-web) note, I spoke some with Mark Birbeck (at XTech 2007) about my interest in working on a 4Suite / FormsPlayer demonstration. I've spent the better part of 3 years working on FormsPlayer as a client-side platform for XML-driven applications served from a 4Suite repository and I've found the combination quite powerful. FormsPlayer (and XForms 1.1 specifically) is really the icing on the cake which takes an XML / RDF Content Management System like the 4Suite repository and turns it into a complete platform for deploying next generation rich web applications.

The combination is a perfect realization of the Rich Web Application Backplane (a reoccurring theme in my last two presentations / papers) and it is very much worth noting that some of the challenges / requirements I've been able address with this methodology can simply not be reproduced in any other approach: neither vanilla DHTML, .NET, J2EE, Ruby on Rails, Django, nor Jackrabbit. The same is probably the case with Silverlight and Apollo.

In particular, when it comes to declarative generation of user interfaces, I have yet to find a more complete approach than via XForms.

Mark Birbeck's presentation on Skimming is a good read (slides / paper is not up yet) for those not quite familiar with the architectural merits of this larger methodology. However, in his presentation eXist was used as the XML store and it struck me that you could do much more with 4Suite instead. In particular, as a CMS with native support for RDF as well as XML it opens up additional avenues. Consider extending Skimming by leveraging the SPARQL protocol as an additional mode of expressive communication beyond 'vanilla' RESTful operations on XML documents.

These are very exciting times as the value proposition of rich web (I much prefer this term over the much beleaguered Web 2.0+) and semantic web applications has fully transitioned from vacuous / academic musings to concretely demonstrable in my estimation. This value proposition is still not being communicated as well as it could, but having bundled demos can bridge this gap significantly in my opinion; much more so than just literature alone.

This is one of the reasons why I've been more passionate about doing much less writing / blogging and more hands-on hacking (if you will). The original thought (early on this year) was that I would have plenty to write about towards the middle of this year and time spent discussing the ongoing work would be premature. As it happens, things turned out exactly this way.

There is a lesson to be learned for how the Joost project progressed to where it is. The approach of talking about deployed / tested / running code has worked perfectly for them. I don't recall much public dialog about that particular effort until very recently and now they have running code doing unprecedented things and the opportunity (I'm guessing) to switch gears to do more evangelism with a much more effective 'wow' factor.

Speaking of wow, I must say of all the sessions at XTech 2007, the Joost session was the most impressive. The number of architectures they bridged, the list of demonstrable value propositions, the slick design, the incredibly agile and visionary use the most appropriate technology in each case etc.. is an absolutely stunning achievement.

The fact that they did this all while remembering their roots: open standards, open source, open communities leaves me with a deep sense of respect for all those involved in the project. I hope this becomes a much larger trend. Intellectual property paranoia and cloak / dagger completive edge is a thing of the past in today's software problem solving landscape. It is a ridiculously outdated mindset and I hope those who can effect real change (those higher up in their respective ORG charts than the enthusiastic hackers) in this regard are paying close attention. Oh boy. I'm about to launch into a rant, so I think I'll leave it at that.

The short of it is that I'm hoping (very soon) to switch gears from heads-down design / development / testing to much more targeted write-ups, evangelism, and such. The starting point (for me) will be Semantic Technology Conference in San Jose. If the above topics are of interest to you, I strongly suggest you attend my colleague's (Dr. Chris Pierce) session on SemanticDB (the flagship XML & RDF CMS we've been working on at the Clinic as a basis for Computerized Patient Records) as well as my session on how we need to pave a path to a new generation of XML / RDF CMSes and a few suggestions on how to go about paving this path. They are complementary sessions.

Jackrabbit architecture

JSR 170 is a start in the right direction, but the work we've been doing with the 4Suite repository for some time leaves me with the strong, intuitive impression that CMSes that have a natural (and standardized) synthesis with XML processing is only half the step towards eradicating the stronghold that monolithic technology stacks have over those (such as myself) with 'enterprise' requirements that can truly only be met with the newly emerging sets of architectural patterns: Semantic / Rich Web Applications. This stronghold can only be eradicated by addressing the absence of a coherent landscape with peer-reviewed standards. Dr. Macro has an incredibly visionary series of 'write-ups' on XML CMS that paints a comprehensive picture of some best practices in this regard:

However (as with JSR 170), there is no reason why there isn't a bridge or some form of synthesis with RDF processing within the confines of a CMS.

There is no good reason why I shouldn't be able to implement an application which is written against an abstract API for document and knowledge management irrespective of how this API is implemented (this is very much aligned with larger goal of JSR 170). There is no reason why the 4Suite repository is the only available infrastructure for supporting both XML and RDF processing in (standardized) synthesis.

I should be able to 'hot-swap' RDFLib with Jena or Redland, 4Suite XML with Saxon / Libxml / etc.., and the 4Suite repository with an implementation of a standard API for synchronized XML / RDF content management. The value of setting a foundation in this arena is applicable to virtually any domain in which a CMS is a necessary first component.

Until such a time, I will continue to start with 4Suite repository / RDFLib / formsPlayer as a platform for Semantic / Rich Web applications. However, I'm hoping (with my presentation at San Jose) to paint a picture of this vacuum with the intent of contributing towards enough of a critical mass to (perhaps) start putting together some standards towards this end.

Chimezie Ogbuji

via Copia

Amara 1.2rc1

4Suite has been bumped to 1.0.2 with some important bug fixes. I also pushed Amara a step closer to 1.2 with a 1.2rc1 release. I'll make it 1.2 final some time this week, and then on to some pretty big architectural changes for 2.0. All test reports are welcome, especially from Web server users. Jeremy might have figured out a workaround fo the multiple-interpreter issue discussed in "multiple interpreters and extension modules". That should fix remaining known problems with mod_python.

[Uche Ogbuji]

via Copia

Amara 1.2 goes alpha, and other developments

First of all 4Suite went 1.0 rather quietly because the day job schedule has left room for very little besides quiet releases. It's probably just as well because by common standards 4Suite has been 1.0 grade for years. Under any less conservative version numbering scheme it would be 4Suite 3.0 by now.

I'm pushing Amara to 1.2 (a more typical progression of version numbers in that case) and after a developers-only alpha, we've released alpha 1.2a2 publicly, but quietly. As I've hinted before I have a lot of ideas for Amara post 1.2. The next major branch will be a full rewrite, probably to be released as Amara 2.0. Anyway, see the draft for the 1.2 full release announcement.

I also put together a quick start recipe for Amara on Ubuntu, and Luis Miguel Morillas has one for Windows users in Spanish. He says he'll be translating it to English soon, and when he does, I'm sure he'll link it from his "Amara Installers for Windows Users" page.

[Uche Ogbuji]

via Copia

Quick grab of XHTML metadata

I recently needed some code to quickly scrape the metadata from XHTML Web pages, so I kicked up the following code:

import amara

XHTML1_NS = u'http://www.w3.org/1999/xhtml'
PREFIXES = { u'xh': XHTML1_NS }

def get_xhtml_metadata(source):
    md = {}
    for node in amara.pushbind(source, u'/xh:html/xh:head/*', prefixes=PREFIXES):
        if node.localName == u'title':
            md[u'title'] = unicode(node)
        if node.localName == u'link':
            linkinfo = dict([ (attr.name, unicode(attr))
                              for attr in node.xml_xpath(u'@*') ])
            md.setdefault(u'links', []).append(linkinfo)
        elif node.xml_xpath(u'self::xh:meta[@name]'):
            md[node.name] = unicode(node.content)
    return md

if __name__ == "__main__":
    import sys, pprint
    source = sys.argv[1]
    pprint.pprint(get_xhtml_metadata(source))

So, for example, scraping planet XML:

$ python xhtml-metadata.py http://planet.xmlhack.com/
{u'links': [{u'href': u'planet.css',
             u'media': u'screen',
             u'rel': u'stylesheet',
             u'title': u'Default',
             u'type': u'text/css'},
            {u'href': u'/index.rdf',
             u'rel': u'alternate',
             u'title': u'RSS',
             u'type': u'application/rss+xml'}],
 u'title': u'Planet XMLhack: Aggregated weblogs from XML hackers and commentators'}

[Uche Ogbuji]

via Copia

Amara trimxml: an XML reporting tool

For the past few months in my day job (consulting for Sun Microsystems) I've been working on what you can call a really big (and hairy) enterprise mashup. I'm in charge of the kit that actually does the mashing-up. It's an XML pipeline that drives merging, processing and correction of data streams. There are a lot of very intricately intersecting business rules and without the ability to make very quick ad-hoc reports from arbitrary data streams, there is no way we could get it all sorted out given our aggressive deadlines.

This project benefits greatly from a side task I had sitting on my hard drive, and that I've since polished and worked into the Amara 1.1.9 release. It's a command-line tool called trimxml which is basically a reporting tool for XML. You just point it at some XML data source and give it an XSLT pattern for the bits of interest and optionally some XPath to tune the report and the display. It's designed to only read as much of the file as needed, which helps with performance. In the project I discussed above the XML files of interest range from 3-100MB.

Just to provide a taste using Ovidiu Predescu's old Docbook example, you could get the title as follows:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

Since you know there's just one title you care about you can make sure trimxml stops looking after it finds it

trimxml -c 1 http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

-c is a count of results and you can set it to other than 1, of course.

You can get all titles in the document, regardless of location:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title

Or just the titles that contain the string "DocBook":

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title "contains(., 'DocBook')"

The second argument is an filtering XPath expression. Only nodes that satisfy that condition are reported.

By default each entire matching node is reported, so you get an output such as "". You can specify something different to display for each match using the -d flag. For example, to just print the first 10 characters of each title, and not the title tags themselves, use:

trimxml -d "substring(., 0, 10)" http://xslt-process.sourceforge.net/docbook-example.xml title

There are other options and features, and of course you can use the tool on local files as well as Web-based files.

In another useful development in the 4Suite/Amara world, we now have a Wiki.

With 4Suite, Amara, WSGI.xml, Bright Content and the day job I have no idea when I'll be able to get back to working on Akara, so I finally set up some Wikis for 4Suite.org. The main starting point is:

http://notes.4suite.org/

Some other useful starting points are

http://notes.4suite.org/AmaraXmlToolkit
http://notes.4suite.org/WsgiXml

As a bit of an extra anti-vandalism measure I have set the above 3 entry pages for editing only by 4Suite developers. [...] Of course you can edit and add other pages in usual Wiki fashion. You might want to start with http://notes.4suite.org/4SuiteFaq which is a collaborative addendum to the official FAQ.

[Uche Ogbuji]

via Copia

Amara en Español

My focus during open-source development availability has been on pushing 4Suite XML to 1.0 (and we're on the very final leg of that journey). I'm still putting a bit of time into Amara, but I should have even more time for it soon, and I have many ideas for what to do with that time.

Others have been up to fun stuff with Amara as well, and no more so, it seems, than Spanish speakers. Luis Miguel Morillas has been putting Amara through its paces in his LivingPyXML project. César Cárdenas Desales has contributed a nice intro "Procesamiento fácil de XML con Python y Amara"

A pesar de que la libreria estándar Python cuenta con herramientas y modulos para el procesamiento de XML con SAX y DOM, muchos programadores han pensado que podrían existir formas más simples de trabajar con XML. Amara es un conjunto de herramientas que sirven para facilitar el procesamiento de XML usando Python. En este manual se da una breve introducción al uso de Amara para dichas tareas.

Yep. That was pretty much the entire idea.

Original link (not as up-to-date): "Procesamiento fácil de XML con Python y Amara"

[Uche Ogbuji]

via Copia

Tagging meets hierarchies: XBELicious

The indefatigable John L. Clark recently announced another very useful effort, the start of a system for managing your del.icio.us bookmarks as XBEL files. Of course not everyone might be as keen on XBEL as I am, but even if you aren't, there is a reason for more general interest in the project. It uses a very sensible set of heuristics for mapping tagged metadata to hierarchical metadata. del.icio.us is all Web 2.0-ish and thus uses tagging for organization. XBEL is all XML-ish and thus uses hierarchicy for same. I've long wanted to document simple common-sense rules for mapping one scenario to another, and John's approach is very similar to sketches I had in my mind. Read section 5 ("Templates") of the XBELicious Installation and User's Guide for an overview. Here is a key snippet:

For example, if your XBEL template has a hierarchy of folders like "Computers → linux → news" and you have a bookmark tagged with all three of these tags, then it will be placed under the "news" folder because it has tags corresponding to each level in this hierarchy. Note, however, that this bookmark will not be placed in either of the two higher directories, because it fits best in the news category. A bookmark tagged with "Computers" and "news" would only be placed under "Computers" because it doesn't have the "linux" tag, and a bookmark tagged with "linux" and "news" would not be stored in any of these three folders.

XBELicious is work in progress, but worthy work for a variety of reasons. I hope I have some time to lend a hand soon.

[Uche Ogbuji]

via Copia