Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples

Since seeing Mike Kay's presentation at XTech 2005 I've been meaning to write up some Amara equivalents to the examples in the paper, "Comparing XSLT and XQuery". Here they are.

This is not meant to be an advocacy piece, but rather a set of useful examples. I think the Amara examples tend to be easier to follow for typical programmers (although they also expose some things I'd like to improve), but with XSLT and XQuery you get cleaner declarative semantics, and cross-language support.

It is by no means always true that an XSLT stylesheet (whether 1.0 or 2.0) is longer than the equivalent in XQuery. Consider the simple task: create a copy of a document that is identical to the original except that all NOTE attributes are omitted. Here is an XSLT stylesheet that does the job. It's a simple variation on the standard identity template that forms part of every XSLT developer's repertoire:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@* except @NOTE"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

In XQuery, lacking an apply-templates instruction and built-in template rules, the recursive descent has to be programmed by hand:

declare function local:copy($node as element()) {
  element {node-name($node)} {
    (@* except @NOTE,
    for $c in child::node
    return typeswitch($c) 
      case $e as element() return local:copy($a)
      case $t as text() return $t
      case $c as comment() return $c
      case $p as processing-instruction return $p
  }
};

local:copy(/*)

Here is Amara code to do the same thing:

def ident_except_note(doc):
    for elem in doc.xml_xpath(u'//*[@NOTE]'):
        del elem.NOTE
    print doc.xml()

Later on in the paper:

...nearly every FLWOR expression has a direct equivalent in XSLT. For example, to take a query from the XMark benchmark:

for    $b in doc("auction.xml")/site/regions//item
let    $k := $b/name
order by $k
return <item name="{$k}">{ $b/location } </item>

is equivalent to the XSLT code:

<xsl:for-each select="doc('auction.xml')/site/regions//item">
  <xsl:sort select="name"/>
  <item name="{name}"
     <xsl:value-of select="location"/>
  </item>
</xsl:for-each>

In Amara:

def sort_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    items = doc.xml_xpath(u'/site/regions//item')
    items.sort()
    for item in items:
        newdoc.xml_append(
            newdoc.xml_element(u'item', content=item)
        )
    newdoc.xml()

This is the first of a couple of examples from XMark. To understand the examples more fully you might want to browse the paper, "The XML Benchmark Project". This was the first I'd heard of XMark, and it seems a pretty useful benchmarking test case, except that it's very heavy on records-like XML (not much on prosy, narrative documents with mixed content, significant element order, and the like). As, such I think it could only ever be a sliver of one half of any comprehensive benchmarking framework.

I think the main thing this makes me wonder about Amara is whether there is any way to make the element creation API a bit simpler, but that's not a new point for me to ponder, and if I can think of anything nicer, I'll work on it post 1.0.

Kay's paper next takes on more complex example from XMark: "Q9: List the names of persons an the names of items they bought in Europe". In database terms this is a joins across person, closed_auction and item element sets. In XQuery:

for $p in doc("auction.xml")/site/people/person
let $a := 
   for $t in doc("auction.xml")/site/closed_auctions/closed_auction
   let $n := for $t2 in doc("auction.xml")/site/regions/europe/item
                       where  $t/itemref/@item = $t2/@id
                       return $t2
       where $p/@id = $t/buyer/@person
       return <item> {$n/name} </item>
return <person name="{$p/name}">{ $a }</person>

Mike Kay's XSLT 2.0 equivalent.

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each 
        select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath(u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

I think the central loop in this case is much clearer as a Python list comprehension than in either the XQuery or XSLT 2.0 case, but I think Amara suffers a bit from the less literal element creation syntax, and for the need to "cast" to Unicode. I would like to lay out cases where casts from bound XML structures to Unicode make sense, so I can get user feedback and implement accordingly. Kay's final example is as follows.

The following code, for example, replaces the text see [Kay, 93] with see Kay93.

<xsl:analyze-string select="$input" regex="\[(.*),(.*)\]">
<xsl:matching-substring>
  <citation>
    <author><xsl:value-of select="regex-group(1)"/></author>
    <year><xsl:value-of select="regex-group(2)"/></year>
  </citation>
</xsl:matching-substring>
<xsl:non-matching-substring>
  <xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>

The only way of achieving this transformation using XQuery 1.0 is to write some fairly convoluted recursive functions.

Here is the Amara version:

import re
PATTERN = re.compile(r'[(.*),(.*)]')
def repl_func(m):
    citation = doc.xml_element(u'item')
    citation.xml_append(doc.xml_element(u'author', content=m.group (1)))
    citation.xml_append(doc.xml_element(u'year', content=m.group (2)))
    return citation.xml(omitXmlDeclaration=u'yes')
text = u'see [Kay, 93]'
print PATTERN.subn(repl_func, text)

I think this is very smooth, with the only possible rough patch again being the output generation syntax.

I should mention that Amara's output syntax isn't really bad. It's just verbose because of its Python idiom. XQuery and XSLT have the advantage that you can pretty much write XML in-line into the code (the templating approach), whereas Python's syntax doesn't allow for this. There has been a lot of discussion of more literal XML template syntax for Python and other languages, but I tend to think it's not worth it, even considering that it would simplify the XML generation syntax of tools such as Amara. Maybe it would be nice to have a lightweight templating system that allows you to write XSLT template chunks in-line with Amara code for data processing, but then, as with most such templating systems, you run into issues of poor model/presentation separation. Clearly this is a matter for much more pondering.

[Uche Ogbuji]

via Copia

Good times in Amsterdam

I already posted my notes about the XTech conference itself, but there was more to the trip than that. I had a wonderful time in a wonderful city. Edd is not much for nocturnes and other night-time events, and I think this is the way it should be. After attending sessions from 9 through 5, who has the brain cells left to think more about XML? Time to hit the streets. (All pictures have titles: hold your mouse pointer over them).

I stayed at the Hotel Barbizon. I am not recommending it (it does beat the goo out of the Novotel at the RAI, though). Next time I go, if they have a reasonable rate available, I'm staying where Damien Steer and Libby Miller did, at the Amsterdam American Hotel. Great location (right on the Leidesplein). It turned out to be a common meeting place

As always at these conferences, I hung out a good deal with Libby and Damian (They also hung out with Lori, the kids, DanBri, Max and me in Bath in 2003). They're a lot of fun to explore and chill with. The XML crowd is, in general. I've always been impressed by the backgrounds of people in the XML circle. To a much greater extent than most expert developer scenes they come from backgrounds from literature, visual art and graphic design, music and theater, relational and object-oriented methodologies, engineering, and more.

I also met a lot of new folks, which is always cool. Ian Forrester and and his wife Sarah were the highlight, I think. Sarah is from Racine, Wisconsin, and reminds me a lot of my clever, hilarious, strong- willed, energetic wife Lori. I've found there's a definite class of women from Wisconsin with those impressive characteristics. Kristen, Fourthought's current manager at Sun, is another. Next time Lori and I head to the UK (next year, prolly), we're hunting down Ian and Sarah to hang out with (whether they like it or not). I also met and had a good time with Ralph Meijer, who was presenting in place of my friend Peter Saint-Andre, Ruud Steltenpool who was pimping SVG Open 2005 and who was kind enough to bring a basketball (I'd asked for ideas on exercise opportunities), Dominique Hazaël- Massieux, Håkon Lie, Michael Day, yet another Dan and yet another Matt (both Brits whose surnames I forget). I met in person for the first time electronic acquaintances Mark Baker and Sebastian Schnitzenbaumer. There were also usual compadres Edd, Eric, Leigh, Micah (The XMLHACK gang), DanBri, MattB, Dahobe, DanCon, Liam, Ricko, Liz, etc.

Amsterdam doesn't seem to have as much fly graffiti as, say Paris or London, but then again I haven't explored it as widely as London. Norm, who clearly has a good photographer's eye, did find a good deal of street art (a good deal indeed), mostly in the form of stickers. Last year Libby captured a beaut. I did find a stenciled cut at Bush and a few other bits, but nothing very exciting.

Thursday night the afters consisted of a party at Steve Pemberton's spacious apartment. The views from his roof are spectacular (see top image, for example). On Friday, after some tapas a large group of XTechers lolled about at Het Vondelpark. I was still craving exercise, so I wandered around until I found a bunch of (I think) Surinamese playing keepie-uppie (a.k.a. juggling) with a soccer ball. They were pretty damned good, but I'm decent enough, so I asked to join them, and did. I then returned to the group of XTechers.

After an Italian dinner we all split up and I went to check out the clubs near Leidseplein and the Reguilerswaarstraat (or something like that). It was fun, except that I was lacking a bit of steam. I noticed that a very weird combination of Samba and Euro electro seemed to be all the rage. Saturday, after a gratifying sleep-in I joined Damien, Libby and DanBri at the Leidseplein. We did some shopping, made an abortive attempt to go on a cruise (line was too long), and, joined by Liam and Mark Baker strolled around until we got to a BarBQ to which Liz and Dave had kindly invited us. We were joined by some of a very fun collective called, apparently, "Hippies from Hell", ate, chatted, joked, and recited poetry in English and Dutch (Liz treated us to Wendy Cope, for example), and that closed off my trip very pleasantly.

For more pictures check out the Flickr XTech tag.

[Uche Ogbuji]

via Copia

XTech: Mike Kay on XQuery and XSLT 2.0

As I mentioned in my more complete report Mike Kay's presentation was worth a further entry (besides, my note-taking discipline went to hell right after his talk, so I don't have as much to work with on the rest).

The title was Comparing XSLT and XQuery. Much of what Mike discussed applies to XSLT 1.0 as well as 2.0. He did spend some time talking about the role of XPath 2.0 as the basis of both XSLT 2.0 and XQuery. As he puts it XSLT 2.0 is a 2-language system. You call XPath from specific constructs within XSLT. XQuery on the other hand has XPath incorporated into basic language. The way I think of it, XSLT is a host language for XPath, while XQuery is a (greatly) extended version of XPath.

I think the most important contribution Mike made in this paper was a very sober appraisal of the barriers to learning XSLT and XQuery. The difficulties developers have with XSLT are well known: we've had some 6 years to discuss them them. Mike summarizes them as follows:

  1. XML fundamentals: encoding, entities, white space, namespaces, etc.
  2. Declarative programming: variables, recursion, paths, grouping
  3. Data model: the mental shift from the angle brackets they see to the abstraction of nodes
    • confusion between what devs see in the XML versus what their program sees
    • confusion over proper output, e.g. subtlety that creating an element in the output tree is not the same thing as creating text containing angle brackets
  4. Rule-based programming:
    • template dispatch, which forces a non-linear way of thinking about transforms. Mike Kay mentioned the parallels with GUI programming. (I tend to think this common comparison is generally right, but is just stretched enough to be unhelpful in determining how to get developers in the right mind-set).

Mike Kay had Ken Holman in the audience so he did the sensible thing in asking the foremost expert on XML-related training. Ken agreed: "Yep. That hits the high points of the first day of getting people to know what's going on in XSLT."

In my opinion, there is one more category of difficulty, which is capability limitations in XSLT 1.0 (most of which are addressed in EXSLT or XSLT 2.0). This includes frustrations such as the result tree fragment/node set split, the poor facilities for string manipulation, node set operations, date/time processing, etc.

Mike feels that XQuery only eliminates the 4th barrier (it has no templates). Reading between the lines, this is a powerful indictment of the idea of a separate XQuery. I think it's hard to argue that we need such a complex separate language purely from the pedagogical viewpoint (no, I'm not saying "andragogical").

Mike pointed out that people coming to XQuery from SQL tend to write everything in FLWOR expressions (rather than, say XPath with predicates). FLOWR is comfy and SQLey, but this just annoys me. I've pointed out in my bemoaning of SPARQL how unfortunate I think it is that SQL people insist on turning all other languages into some nasty mutation of SQL. I was suitably entertained by seeing Mike demonstrate how easy it is to get caught up in the subtle differences between SQL and FLOWR. Again I'm reading between the lines, but I got the sense that Mike was himself not unamused by the task of pointing out such trip-wires.

Mike finished up with a benchmark which he prefaced with an armload of caveats (a healthy practice, as I've learned from experiences in benchmarking). Saxon running XSLT trounced all processors except for MSXML on a certain task involving the XSLT analogue of a relational join, and with document sizes of 1MB, 4MB and 10MB. In a surprise result, Saxon running XSLT even beat Saxon running XQuery (As Mike said, "in the XQuery world implementors look to optimize joins"). All the XSLT processors suffered N^2 performance degradation with doc size. But strangely enough some of the XQuery tools did as well, including Galax. Qizx did show linear characteristics.

Kay then proved that there is no reason one cannot optimize joins for XSLT by writing a join optimizer for Saxon/XSLT. When he updated the benchmark result slide to show the fruit of this join optimization, we were all astonished to see how thoroughly Saxon ended up trouncing everything in the field at all three doc sizes. Now that, my friends, is the work of a superstar developer.

I'll be tinkering with how Amara handles some of Mike's XSLT and XQuery examples in a coming entry.

See also:

[Uche Ogbuji]

via Copia

Scattered notes from XTech

XTech 2005. Amsterdam. Lovely time. But first of all, I went for a conference. Edd Dumbill outdid himself this time. The first coup de maître was sculpting the tracks to increase the interdisciplinary energy of the meet. The browser track brought out a lot of new faces and provided a jolt of energy. There did seem to be a bit of a divide between the browser types and the XML types, but only as much as one would expect from the fact that XML types tend to know each other, and ditto browser types. There was plenty of crosstalk between the disciplines as well.

Second touch: focus on open data, and all the excitement in that area (Creative Commons, remixing/mash-ups, picture sharing, multimedia sharing, microformats, Weblogging, content syndication, Semantic technology, podcasting, screencasting, personal information spaces, corporate info spaces, public info spaces, etc.) and watch the BBC take over (with they bad selves). And don't fret: "damn, maybe we should lighten up on the BBC bias it he speakers". No, just go with it. Recognize that they are putting forth great topics, and that everyone is amped about how the BBC is leading the way on so many information technology and policy fronts.

Third touch: foster collaboration. Put up a Wiki, encourage folks to an IRC channel, aggregate people's Weblog postings and snapshots into one place, Planet XTech, and cook up a fun little challenge to go with the theme of open data. For that last bit Edd put out an XML representation of the conference schedule and asked folks to do something cool with it. I didn't do as much with it as I'd hoped. When I finally got my presentation going I used the posted grid.xml as a demo file for playing with Amara, but I wished it had more content, especially mixed content (it's very attribute heavy). I've suggested on the XTech Wiki that if Edd does the same thing next time, that he work in paper abstracts, or something like that, to get in more text content.

I said "When I finally got my presentation going", which hints at the story of my RAI (venue for XTech) jinx. Last year in Amsterdam I couldn't get my Dell 8600 running Fedora Core 3 to agree with the projectors at the RAI. As Elliotte Rusty Harold understates in his notes from the 2004 conference:

After some technical glitches, Uche Ogbuji is talking about XML good practices and antipatterns in a talk entitled "XML Design Principles for Form and Function"

In fact I ended up having to switch to OpenOffice on Windows, and the attendees endured a font only a hippie could love (Apparently Luxi Sans is not installed by default on Windows and OO/Win has a very strange way of finding a substitute). I'm vain enough not to miss quoting another bit about my talk from Elliotte:

A very good talk. I look forward to reading the paper. FYI, he's a wonderful speaker; probably the best I've heard here yet.

Gratifying to know I managed a good talk under pressure. I hope I did so again this time, because the RAI projectors were no more friendly. The topic was "Matching Python idioms to XML idioms". Remembering the last year's headache I asked about a projector to use to test out my presentation (I was on the first day, Weds). Usually conference speakers' rooms have a spare projector for this purpose, but it looks as if the RAI couldn't supply one. I crossed my fingers and arrived for my talk the dutiful 15 minutes early. Eric van der Vlist was up before me in the block. The AV guy came along and they spent quite a while struggling with Eric's laptop (Several speakers had trouble with the RAI projectors). They finally worked out a 640x480 arrangement that caused him to have to pan around his screen. This took a while, and the AV guy bolted right afterward and was not there to help me set up my own laptop. Naturally, neither I nor the very helpful Michel Biezunski (our session chair) were able to get it to work, and we had to turn things over to Eric to start his talk.

We then both went in search of the AV guy, and it took forever to find him. No, they didn't have a spare projector that we could use to set up my laptop in time for my talk. We'd just have to wait for Eric to finish and hope for the best (insert choice sailor's vocabulary here). My time slot came and we spent 20 minutes trying every setting on my laptop and every setting on their projector. The AV guys (yeah, when it was crisis time, they actually found they had more than one) muttered taunts about Linux, and it's a lucky thing I was bent on staying calm. I present quite often, and I do usually have to try out a few settings to get things to work, but in my encounters it's only the RAI projectors that seem completely incapable to project from my Linux laptop. In all, I witnessed 4 speakers (3 on Linux and surprisingly one on Mac OS X) who had big problems with the RAI projectors, including one of the keynote speakers. I suspect others had problems as well.

I couldn't take the obvious escape route of borrowing someone else's laptop because the crux of my talk was a demo of Amara and I'd have to install all that as well (Several kind volunteers including Michel had 4Suite installed, but not Amara). After 20 minutes, we agreed that I'd go on with my talk on Michel's computer (Thinkpad running Red Hat 9 and it worked with the projector immediately!), skip the demo, and we'd find another time slot for me to give the entire talk the next day. Quite a few people stuck around through this mess and I'm grateful to them.

The next day we installed Amara on Michel's computer and I gave the presentation in its proper form right after lunch. There was great attendance for this reprise, considering everything. The Amara demo went fine, except that the grid.xml I was using as a sample gave too few opportunities to show off text manipulation. I'll post a bit later on thoughts relating to Amara, stemming from the conference. Norm Walsh was especially kind and encouraging about my presentation woes, and he has also been kind in his notes on XTech 2005:

The presentation [deities] did not smile on Uche Ogbuji. He struggled mightily to get his presentation of Matching Python Idioms to XML Idioms off the ground. In vain, as it turned out (AV problems were all too common for a modern conference center), but he was generous enough try again the next day and it was worth it (thanks Uche!). I'm slowly becoming a Python convert and some of the excellent work that he's done at Fourthought to provide Python access to standard XML in ways that feel natural in Python is part of the appeal.

That's the precise idea. A tool for processing XML that works for both Python heads and XML heads. The whole point of my presentation was how hard this is to accomplish, and how just about every Python tool (including earlier releases of 4Suite) accommodates one side and not the other. The response to Amara from both the Python heads and XML heads makes me feel I've finally struck the right balance.

I got a lot out of the other XTech talks. Read Norm on the keynotes: he pretty much had the same impressions as I did. Props to Michael Kay for his great presentation comparing XSLT 2.0 and XML Query. I took enough notes at that one for a separate entry, which will follow this one. I missed a lot of the talks between Kay's and my own while I was trying (unsuccessfully) to head off the AV gremlins.

Other talks to highlight: Jon Trowbridge's on Beagle (who, you guessed it, had AV problems that ate up a chunk of his time slot). From the project Wiki:

Beagle is a search tool that ransacks your personal information space to find whatever you're looking for. Beagle can search in many different domains: documents, emails, web history, IM/IRC conversation, source code, images, music files, applications and [much more]

Edd had already introduced me to Beagle, but it was really cool to see it in action. I'll have to check it out. Jon also pointed out TomBoy, "a desktop note-taking application for Linux and Unix. Simple and easy to use, but with potential to help you organize the ideas and information you deal with every day." Two projects I'll have to give a spin. Props to Jon for shrugging off the AV woes and giving a fun and relaxed talk.

Robert O'Callahan's talk on the new canvas tag for Mozilla and Safari was memorable if for nothing else than the experience of surfing Google at a 45° angle, with no apparent loss in snappiness. This canvas thingie looks wicked cool, and it's good to see them working to incorporate SVG. I've heard a lot of grumbling from W3C types about canvas, and all we poor browser users in the middle can hope for is some rapid conversion of cool technologies such as XAML, XUL, canvas, SVG, etc. Others have blogged about the opportunities and anxieties opened up by the WHATWG, which one commentator said should have been the "WHAT Task Force" because "WTF" would have been a better acronym. I'm a neutral in these matters, except that I really do with browser folks would do what they can to push people along to XHTML 2.0 rather than cooking up HTML 5.0 and such.

Matt Biddulph was one of the BBC Massive on hand, and his talk "The Application of Weblike Design to Data - Designing Data for Reuse" offered a lot of practical tips on how to usefully open up a large body of data from a large organization.

Dominique Hazaël-Massieux gave a talk on GRDDL (O most unfortunate project name), which was my first hearing of the technology. My brief characterization of GRDDL is as an attempt to pull the Wild West ethos of microformats into the rather more controlled sphere of RDF. It touches on topics in which I've been active for years, including tools for mapping XML to RDF. I've argued all these years that RDF folks will have to embrace general XML in place of the RDF/XML vocabulary if they are to make much progress. They will have to foster tools that make extracting RDF model data from XML a no-brainer. It's great to see the W3C finally stirring in this direction. Dom's presented very well. I asked about the use of other systems, such as schema annotation, for the XML to RDF mapping. It seemed to me that GRDDL was mostly geared towards XSLT. Dom said it is meant to be independent of the mapping mechanism, but in my post-conference reading I'm not so sure. I'll have to ponder this matter more and perhaps post my thoughts. Dom also mentioned PiggyBank, "the Semantic Web extension for Firefox". Kingsley Idehen has a nice blurb on this software. I do hesitate to use it because someone mentioned to me how PiggyBank had gone into crazy thrash mode at one point. I don't muck with my FireFox set-up lightly.

Rick Jelliffe showed off Topologi's lightweight browser TreeWorld, which is XML-oriented and suitable for embedding into other applications.

Others have blogged Jean Paoli's closing keynote (http://glazman.org/weblog/dotclear/index.php?2005/05/29/1059-adam-3">Leigh, etc.). Seems I'm not the only one who was put off by the straight-up product pitch. At least he did a bit of a service by clearly saying "Binary XML: No please". Check out more quotes from XTech.

The conference was superb. Do be sure not to miss it next year. It's looking like Amsterdam will be the venue again. And what of Amsterdam? Besides the conference I had a great time with friends. I'll post on that later.

For the most comprehensive report I've seen to date, see Micah Dubinko's article.

[Uche Ogbuji]

via Copia