Versa: Pattern Matching (article)

My Versa article (Versa: Path-Based RDF Query Language) is up but I've recently been tinkering with Emeka and haven't been able to post about it. I wanted Emeka functional so people could familiarize themselves with the language by example instead of specification deciphering. Simple saying ".help" in a channel where he is located (#swig,#swhack,#4suite,#foaf) should be sufficient. Please, if his commands interfere with an existing bot's, please let me know.

The article is based (in part) on an earlier paper I wrote on Versa. I reworked it to focus more on the use patterns in common with other existing query languages (SPARQL primarily) to make the point that RDF querying is truely not in it's infancy any more. I also wanted to use it as a spring board to suggest some possible enhancements to an already (IMHO) expressive syntax (mostly burrowed from N3)

My hope is to spark some conversation across the opposing ends as well as get people familiar with the language for the betterment of RDF and RDF querying.

See an exchange between Dave Beckett and myself on the #swig scratchpad.

[Uche Ogbuji]

via Copia

Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs Amara demo
url = ""
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

Python community: Transolution, py lib, encutils, pyxsldoc, PDIS and Picket

In a comment on "We need more solid guidelines for i18n in OSS projects", Fredrik Corneliusson mentioned Transolution, "[a translation] suite project I'm working on [with] a XLIFF editor and a Translation Memory. It's written in Python and we need all the help and testers we can get." I browsed the project site, and it seems to me quite comprehensive and well thought-out. It's heavy on XLIFF, which is pretty heavy stuff in itself, but it does links to projects that allow exchange between .po and XLIFF files. It's certainly great to see Python at the vanguard of XML-based i18n.

I found a couple of new tools from the py lib via Grig Gheorghiu's Weblog entry "py lib gems: greenlets and py.xml", which is good reading for those interested in Python and XML. The py lib is just a bundle of useful Python library add-ons. Greg mentioned sample for one of the modules, Armin Rigo's greenlets. Greenlets are a "spin-off" from Stackless Python, and thus provide some very interesting code that manipulates Python flow of control in order to support lightweight concurrency (microthreads), and also proper coroutines. I've already been pecking about the edges of what's possible with semi-coroutines, and it has always been clear to me that what Python needs in order to really bring streaming XML processing to life is full coroutine support (which seems to be on the way for Python 2.5. While we wait for full coroutines in Python, Armin gets us started with a greenlets demo that turns PyExpat's callback model into a generator that returns elements as they're parsed. Grig posts this snippet as "" in his entry.

Grig also touches on Holger Krekel's py.xml, a module for generating XML (and HTML). py.xml is not unlike JAXML, which I covered in "Three More For XML Output". These all seem to project back to Greg Stein's early proposal for an XML generation tool that certainly worth a look for as ingrown as possible into Python's syntax.

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update accommodates changes in what has recently been announced as CherryPy 2.1 beta. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.

I recently discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.

[Uche Ogbuji]

via Copia

RDF IRC Agent - Emeka

I've recently been working on an IRC bot (based on Sean Palmer's phenny) called Emeka which is meant as a tool for demonstrating Versa and other related RDF facilities. Currently, it supports two functions:

  • .query <abritrary URI> " .. Versa Query .. "
  • .query-deep <arbitrary URI> steps " .. Versa Query .. "

The first, causes Emeka to parse the RDF document (associated with the given URI) as RDF/XML and then as N3 (if the first attempt fails). He then evaluates the submitted Versa Query against the Graph and responds with a result. The second function does the same with the exception that it recursively loads RDF graphs (following rdfs:seeAlso statements) N times, where N is the second argument. This is useful for extracting FOAF communities from a starting document (which was my original motivation for this).

By default Emeka has the following namespace prefixes bound:

daml,rdf,rdfs,owl,xsd,log (n3's log), dc,rss,foaf

Emeka is a work in progress and is currently idling in #swhack and #4suite (as we speak and #foaf,#swig eventually). Some ideas for other services available from this bot:

  • augmenting it's default namespace mapping
  • stored queries (for example: query for retrieving the latest rss:item in a feed)
  • Rule invokation (through FuXi's prospective-query function)
  • Interactive question and example demonstration of Versa function(s)
  • More sophisticated interaction with RSS feeds (for web page cataloging)

Other suggestions are welcome

see #swhack logs for an example

Chimezie Ogbuji

via Copia

FuXi: FOAFed and DOAPed

I just upgraded and repackaged FuXi (v0.7): added some extra prints in the Versa extension function, added a 'test' directory to the source tree with an appropriate example of how FuXi could be used to make a prospective query against OWL rules and a source graph, created a DOAP instance for FuXi, a FOAF instance for myself, created a permanent home for FuXi, and added FuXi to the SemanticWebDOAPBulletinBoard WiKi. This was primarily motivated by Libby's post on Semantic Web Applications and Demos. I thought it would be perfect forum for FuXi. I probably need to move it into a CVS repository when I can find time.

Below is the output of running the test case:

loaded model with owl/rdfs minimal rules and initial test facts
executing Versa query:
                               'distribute(list(test:Lion,test:Don_Giovanni),\'. - rdf:type -> *\')',
extracted 35 rules from urn:uuid:MinimalRules
extracted 16 facts from source model (scoped by urn:uuid:InitialFacts) into interpreter. Executing...
inferred 15 statements in 0.0526609420776 seconds
time to add inferred statements (to scope: urn:uuid:InitialFacts): 0.000159025192261
compiled prospective query, executing (over scope: urn:uuid:InitialFacts)
time to execute prospective query:  0.0024938583374
time to remove inferred statements:  0.0132219791412

This particular test extracts inferred statements from an initial graph using a functional subset of the original owl-rules.n3 and executes the Versa query (which essentially asks: what classes do the resources test:Lion and test:Don_Giovanni belong to?) to demonstrate OWL reasoning (class membership extension by owl:oneOf and owl:unionOf, specifically).

see previous posts on FuXi/N3/4Suite:

Chimezie Ogbuji

via Copia

OPML, XOXO, RDF and more on outlining for note-taking

There has been a lot of good comment on my earlier entry on outline formats in XML. I've been pretty busy the past week or so, but I'd better get my thoughts down before they deliquesce.

Bob DuCharme pointed me at Micah's article which includes mention of XOXO. Henri Sivonen asked what I think of it.

Taking the name "outlining format" literally, it's actually just fine. As Micah says:

Some people might feel warmer and fuzzier with elements named outline, topic, item, and so on, or with elements in a freshly minted namespace, but microformats can still claim the semantic high ground, even when reusing XHTML. In the above, the parts of an outline are ordered lists and list items, exactly as the XHTML element names say.

The problem is that what made me start looking into outlining formats was the fact that I'd heard from others that these make such a great format for personal information space organization, and XOXO is just about useless in that regard.

Along that vector, I wonder what a pure outline format is useful for, anyway? I can't remember having ever needed a stand-alone outline document separate from what I'm outlining. If I'm writing a presentation or a long article, I'd prefer to have the table of contents or presentation outline section generated from the titles and structure of the full work. Sure, XOXO might be suitable for such a generated outline, but my exploration is really about hand editing.

In short I think XOXO is just fine for outlining, and yet I can't imagine when I'd ever use it. As others have mentioned, and as I suspected, the entire idea of outlining formats for general note-taking is a big stretch. Danny Ayers mentioned in a comment on the earlier format that for some attraction to OPML is a matter of neat outlining UIs. I've always been conservative in adopting UIs. I use emacs plus the command line for most of my coding, and after trying out a half dozen blog posting tool for posting to Copia, I ended up writing an e-mail-to-post gateway so that I can enter text (markdown) into a UI I'm already familiar with, Evolution's e-mail composition window.

As I said in the earlier entry, full-blown XHTML 2.0 makes more sense than an outlining format for managing a personal information space, and yet it seems too weak to me for this purpose. The weakness, as Danny points out, is semantic. If everything in my personal information space is just a para or an anchor or a list, I'll quickly get lost. As followers of Copia know, my brain is a rat trap of wandering thoughts, and I'm a poster child for the need for clearly expressed semantics.

As an RDF pioneer, I'm happy to use ideas from RDF, but I do not want to type RDF/XML by hand. I've always argued, as Danny Ayers hinted, that RDF should strive hard to by syntax agnostic, especially because RDF/XML is awful syntax. I agree with him that GRDDL is a good way to help rescue XHTML microformats from their semantic soup, and I think this is a better approach than trying to shovel all the metadata into the XHTML header (Dan Brickley mentions this possibility, but I wonder whether he tends to prefer it to GRDDL). GRDDL has a natural draw for me since I've been working with and writing tools for the XML+XSLT=RDF approach for about four years. But when I'm using markup for markup (e.g. in a personal information space) I'd rather have semantic transparency fitting comfortably within the markup, rather than dangling off it as an afterthought. In a nutshell, I want to use the better markup design of:


rather than the kludge of:

<ul class="to-do">

I think there's little excuse for why we don't have the best of both worlds. People should be able to enjoy the relative semantic cleanliness of RDF, but within the simplest constructs of markup, without having to endure the added layer of abstraction of RDF. That added layer of abstraction should only be for those aggregating models. The fact that people would have to pay the "RDF tax" every time they start to scribble in some markup explains why so many markup types dislike RDF. I'm not sure I've found as clear a case for this point than this discussion of extended uses for outlining formats.

Microformats are generally a semantic mess, from what I've seen of them. They do best when they just borrow the semantics of existing formats, as XOXO does, but I think they're not the solution to lightweight-syntax +clean-semantics that the GRDDL pioneers hope. GRDDL has too much work to do in bringing the rigor of RDF to microformats, and this work should be part of the format itself, not something like GRDDL. I think the missed opportunity here is that XML schema systems cling so stubbornly to syntax-only-syntax. As I've been exploring in articles such as "Use data dictionary links for XML and Web services schemata" (I have a more in-depth look at this in the upcoming Thinking XML article), one can make almost all the gains of RDF by putting the work into the XML schema, rather than heaping the abstraction directly into the XML format. And the schema is where such sophistication belongs.

But back to outlining and personal information spaces, I've tried the personal Wiki approach, and it doesn't work for me. Again Danny nails it: Wiki nodes and links are untyped. This is actually similar to the problem that I have with XHTML, but Wikis are even more of a semantic shambles. In XHTML there is at least a bit of an escape with class="foo". The difficulty of navigating and managing Wikis increases at a much greater rate than their information content, in my experience. My Akara project was in effect an attempt at a more semantically transparent Wiki, but since I wrote that paper I've had almost no time for Akara, unfortunately. I do plan to make it the showcase application for my vision of 4Suite 2.0, and in doing so I have an ally Luis Miguel Morillas, so there is still hope for Akara, perhaps even more so if I am able to build on Rhizome, which might help eliminate some wheel reinvention.

[Uche Ogbuji]

via Copia

Identifying BNodes via RDF Query

Sorry couldn't help but commence round 3 (I believe it is) of Versa vs SPARQL. In all honesty, however, this is has to do more with RDF itself than it with either query language. It is primarily motivated by a very informative and insightful take (by Benjamin Nowack) on the problems regarding identifying BNodes uniquely in a query. His final conclusion (as I understood it) is that although the idea of identifying BNodes directly by URI seems counter-inituitive to the very nature of BNodes (anonymous resources) it is a practical necessity (one that I have had to use more often than not with Versa and caused him to have to venture outside the boundaries of the SPARQL specification for a solution). This is especially the case when you don't have much identifying metadata associated with the BNode in question (where if you did you could rely on inferencing - explicit or otherwise).

Well, ironically, the reason why this issue never occured to me is that in Versa, you refer to resources (for identification purposes) by URI regardless of whether they are blank nodes or not. I guess I would interpet this functionality as leaving it up to the author of the query to understand the exact nature of BNode URI's (that they are transient,possibly inconsistent, etc.)

Chimezie Ogbuji

via Copia

Lifting XSLT into application domain with extension functions?

The conversation continues about the boundary between traditional presentation languages such as XSLT and XML toolkits in traditional application languages such as Python. See earlier installments "Sane template-like output for Amara" and "Why allow template-like output for Amara?". M. David Peterson, the mind behind XSLT blog responded to the latter post, and his response deserves a post of its own:

This is an excellent post that brings out some important points. I do wonder however if the solution can be solved utilizing a base of underlying functions that can then be implemented via 2 mechanisms:

  • The XSLT (potentially 2.0 given 1.0 support already exists) engine which will properly invoke the necessary sequence of functions to process and transform the input XML.
  • Amara, which will invoke a similar combination of functions, however in a way more in line with the Python architecture and programmers mentality.

Being a novice Python programmer, at best!, makes it difficult to suggest this solution with a whole lot of confidence... as such, I won't and simply leave it as something that, if possible, might be worth consideration as it will give leave the door open for the reverse situation, e.g. someone like myself who sees the value Amara and Python offer but given my background would prefer to work with XML via XSLT (2.0 preferably of course :D) except in cases where its obvious the platform offers a much simpler and elegant solution. In cases like this (which is sounds as if this particular situation qualifies) I am definitely more interested in the fastest way in and out of a process than I am in firing up the transformation object to perform something that is rediculously easy using the platform API.

This is an interesting good point, especially because it is something that we already tried to do in 4Suite. When we first started writing the 4Suite repository, we built XSLT scripting into its very DNA. You can do just about anything to the repository through a set of specialized XSLT extensions. You can write entire Web sites (including CRUD function) in XSLT, and indeed, there are quite a few examples of such sites, including:

These are all 100% XSLT sites, implemented using 4Suite's repository as the HTTP server, request processing engine (via XSLT script), database back end, etc. You don't even need to know any Python to write such sites (Wendell Piez proved this when he created The Sonneteer). You just need to know XSLT and the 4Suite repository API as XSLT functions. These APIs mirror the Python APIs (and to some extent the command line and HTTP based API). Mike Olson insisted that this be a core architectural goal, and we got most of the way there. It served the originally narrow needs quite well.

We all saw the limitations of this approach early, but to some extent succumbed to the it-all-looks-like-a-nail fallacy. It was so easy for me, in particular, to churn out 4Suite repo based sites that I used the approach even where it wasn't really the best one. I'm definitely through with all that, though. I've been looking to do a better job of picking the best tool for each little task. This is why I've been so interested in CherryPy lately. I'd rather use a specialized tool for HTTP protocol handling than the just-good-enough version in 4Suite that is specialized for handing off requests to XSLT scripts. Lately, I've found myself advising people against building end-to-end sites in 4Suite. This is not to strand those who currently take advantage of this approach, but to minimize the legacy load as I try to steer the project towards a framework with better separation of concerns.

When I look at how such a framework would feel, I think of my own emerging choices in putting together such sites:

  • 4Suite repository for storing and metadata processing of XML documents
  • Amara for general purpose XML processing in Python code
  • XSLT for presentation logic
  • CherryPy for protocol serving

My own thoughts for a 4Suite 2.0 are pretty much grounded in this thinking. I want to separate the core libraries (basically XML and RDF) from the repository, so that people can use these separately. I also want to de-emphasize the protocol server of the repository, except as a low-level REST API. That having been said, I'm far from the only 4Suite developer and all I can do is throw in my own vote on the matter.

But back to the point raised, the idea of parallel functions in XSLT and another language, say Python, is reasonable, as long as you don't fall into the trap of trying to wrench XSLT too far off its moorings. I think that XSLT is a presentation language, and that it should always be a presentation language. With all the big additions in XSLT 2.0, the new version is still really just a presentation language. Based on my own experience I'd warn people against trying to strain too hard against this basic fact.

And remember that if you make XSLT a full model logic language, that you have not solved the problem of portability. After all, if you depend on extensions, you're pretty much by definition undermining portability. And if XSLT becomes just another competitor to Java, ECMAscript, Python, etc., then you haven't gained any portability benefit, since some people will prefer to use different languages from XSLT in their model logic, and XSLT code is obviously not portable to these other languages. Sure you have platform portability, but you already have this with these other languages as well.

[Uche Ogbuji]

via Copia

No religious conversion to XML

Sylvain Hellegouarch's comments always seem to require another full blog entry for further discussion (that's a good thing: he asks good questions). In response to "Why support template-like output in Amara?", he said:

Regarding the point of bringing developers who dislike XML into the X-technology world, I think it's useful but I hope you won't try too hard. Whatever tools you could bring to them and how hard you may try, if they have a bad feeling about XML & co., you won't be able to change their mind.

That's not really what I meant. I don't go for religious conversions. The issue is not that there are people out there who will never have anything to do with XML. That's fine. The issue is that some people hate XML but at the same time have no choice but to use XML. You hear a lot of comments such as "I hate that stupid XML, but my job requires me to use it". XML is everywhere (I certainly agree it's overused) and most developers cannot avoid XML even if they dislike it. The idea is to give them sound XML tools that feel right in Python, so that they don't shoot themselves in the foot with kludgery such as parsing with regex, or even the infamous:

print "<foo>", spam, "</foo>"

Aside: if anyone who has to deal with XML is not aware of all the myriad ways that the above will bite you in the arse, they should really read "Proper XML Output in Python". The idea is that tools like Amara don't all of a sudden make people like XML, but rather it makes XML safer and easier for people who hate it. Of course it also makes things easier for people who like it, like me.

I "categorise" people who don't like XML into three sections :

  • Those who never tried and simply judge-before-you-taste.
  • Those who tried XML but didn't use it for the right purpose. Some people only see XML as a language used by some dark J2SE application servers for their configuration file. They don't realise that XML is also a meta language that has brought some other fantastic tools to store, describe, transform, validate, query data.
  • Those who simply react to the hype XML had had in the last 5 years. A bit like when you here during months that a movie you haven't seen at the cinema is fantastic and that you should really watch it. You get so tired of hearing it that you don't want to watch it.

Nice classification. I think the good and the bad of XML is that it has brought so many areas of interest together. As I say in this Intel developer journal article:

XML was a development of the document management community: a way to put all their hard-won architectures on the wide, enticing Web, but when it burst on to the scene, the headlines proclaimed a new king of general-purpose data formats. Programmers pounced on XML to replace their frequent and expensive invention of one-off data formats, and the specialized parsers that handled these formats. Web architects seized XML as a suitable way to define content so that presentation changes could be made cleanly and easily. Database management experts adopted XML as a way to exchange data sets between systems as part of integration tasks. Supply-chain and business interchange professionals embraced XML as an inexpensive and flexible way to encode electronic transactions. When so many disparate disciplines find themselves around the same table, something special is bound to happen.

XML itself is not very special. It represents some refinement, some insight, and many important tradeoffs, but precious little innovation. It has, however, become one of the most important new developments in information systems in part because of the fact that so many groups have come to work with XML, but also because it has focused people's attention to important new ways of thinking about application development.

The reason XML is overhyped is because we live in the age of hype. People don't know how to say "X is useful" any more. They're rather say "it's the gods' solution to every plague released by Pandora" or they say "It's the plaything of the guardians of every circle of Hell". XML is neither, of course. It's useful because it happens to be one data format that is respectable in a wide variety of applications. But like any compromise solution, it is bound to have some weaknesses in each specific area.

[Uche Ogbuji]

via Copia

Why support template-like output in Amara?

When I posted "Sane template-like output for Amara", Sylvain Hellegouarch asked:

I feel like you are on about to re-write XSLT using Python only and I wonder why.

I mean for instance, one of the main reason I'm using XSLT in the first place is that whatever programming language I am using, I don't need to learn a new templating language over and over again. I can simply extend my knowledge of only one : XSLT, and then become better in that specific one.

It also really helps me making a difference between the presentation and the logic upon data since my the logic resides within the programming language itself, not the templating language.

Therefore, although your code looks great, I don't feel confident using it since it would go against what I just said above.

This is a question well worth further discussion.

The first thing is that I have always liked XSLT 1.0, and I still do. Nothing I'm doing in Amara is intended to replace XSLT 1.0 for me. However, there are several things to consider. Firstly, Python programmers seem to have a deep (and generally inexplicable) antipathy towards XML technology. I often hear Pythoneers saying things like "I hate XML because it's over-hyped and used all the time, even when it's not the best choice". Well, this applies to every popular software technology from SQL to Python itself. Nothing new in the case of XML. But never mind all that: Pythoneers don't like XML, and it very often drives them to abuse XML (perhaps if something you dislike is ubiquitous, using it poorly seems a small measure of revenge?) Anyway, someone who doesn't like XML is never going to even look slant-wise at XSLT. One reason for making it easy to do in Amara the sorts of things Sylvain and I may prefer to do in XSLT is to accommodate other preferences.

The second thing to consider is that even if you do like XSLT (1.0 or 2.0), there are places where it's best, and places where it's not. I think the cleanest flow for Web output pipelines can be described by the following diagram. I call this the rendering pipeline pattern (no not "pattern" as in big-hype-ooh-this-is-crazy-deep, but rather "pattern" as in I-do-this-sorta-thing-often-and-here's-what-it-looks-like).

Aside: I've been trying out 2.0 beta, and I'm not sure what's up with the wrong-spelling squiggly on the p word. I also wish I didn't have to use screen capture as a poor man's export to PNG.

Separation of model from view should be in addition to separation of content form presentation, and this flow covers both. For my purposes the source data can be a DBMS, flat files, or whatever. The output content is usually an XML format specially designed to describe the information to be presented (the view) in the idiom of the problem domain. The rendered output is usually HTML, RSS, PDF, etc.

In this case, I would use some such as the proposed output API for Amara in the first arrow. Python code would handle the model logic processing and Amara would provide for convenient generation of the output XML. If some of the source data is in XML, then Amara would help further by making that XML a cinch to parse and process. I would use XSLT for the second arrow, whether on the server or, when it is feasible, through the browser.

The summary is that XSLT is not suitable for all uses that result in XML output. In particular it is not generally suitable for model logic processing. Therefore, it is useful for the tools you do use for model logic processing to have good XML APIs, and one can borrow the best bits of XSLT for this purpose without seeking to completely replace XSLT. That's all I'm looking to do.

[Uche Ogbuji]

via Copia