Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

Python community: Transolution, py lib, encutils, pyxsldoc, PDIS and Picket

In a comment on "We need more solid guidelines for i18n in OSS projects", Fredrik Corneliusson mentioned Transolution, "[a translation] suite project I'm working on [with] a XLIFF editor and a Translation Memory. It's written in Python and we need all the help and testers we can get." I browsed the project site, and it seems to me quite comprehensive and well thought-out. It's heavy on XLIFF, which is pretty heavy stuff in itself, but it does links to projects that allow exchange between .po and XLIFF files. It's certainly great to see Python at the vanguard of XML-based i18n.

I found a couple of new tools from the py lib via Grig Gheorghiu's Weblog entry "py lib gems: greenlets and py.xml", which is good reading for those interested in Python and XML. The py lib is just a bundle of useful Python library add-ons. Greg mentioned sample for one of the modules, Armin Rigo's greenlets. Greenlets are a "spin-off" from Stackless Python, and thus provide some very interesting code that manipulates Python flow of control in order to support lightweight concurrency (microthreads), and also proper coroutines. I've already been pecking about the edges of what's possible with semi-coroutines, and it has always been clear to me that what Python needs in order to really bring streaming XML processing to life is full coroutine support (which seems to be on the way for Python 2.5. While we wait for full coroutines in Python, Armin gets us started with a greenlets demo that turns PyExpat's callback model into a generator that returns elements as they're parsed. Grig posts this snippet as "iterxml.py" in his entry.

Grig also touches on Holger Krekel's py.xml, a module for generating XML (and HTML). py.xml is not unlike JAXML, which I covered in "Three More For XML Output". These all seem to project back to Greg Stein's early proposal for an XML generation tool that certainly worth a look for others.is as ingrown as possible into Python's syntax.

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update accommodates changes in what has recently been announced as CherryPy 2.1 beta. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.

I recently discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.

[Uche Ogbuji]

via Copia

Lifting XSLT into application domain with extension functions?

The conversation continues about the boundary between traditional presentation languages such as XSLT and XML toolkits in traditional application languages such as Python. See earlier installments "Sane template-like output for Amara" and "Why allow template-like output for Amara?". M. David Peterson, the mind behind XSLT blog responded to the latter post, and his response deserves a post of its own:

This is an excellent post that brings out some important points. I do wonder however if the solution can be solved utilizing a base of underlying functions that can then be implemented via 2 mechanisms:

  • The XSLT (potentially 2.0 given 1.0 support already exists) engine which will properly invoke the necessary sequence of functions to process and transform the input XML.
  • Amara, which will invoke a similar combination of functions, however in a way more in line with the Python architecture and programmers mentality.

Being a novice Python programmer, at best!, makes it difficult to suggest this solution with a whole lot of confidence... as such, I won't and simply leave it as something that, if possible, might be worth consideration as it will give leave the door open for the reverse situation, e.g. someone like myself who sees the value Amara and Python offer but given my background would prefer to work with XML via XSLT (2.0 preferably of course :D) except in cases where its obvious the platform offers a much simpler and elegant solution. In cases like this (which is sounds as if this particular situation qualifies) I am definitely more interested in the fastest way in and out of a process than I am in firing up the transformation object to perform something that is rediculously easy using the platform API.

This is an interesting good point, especially because it is something that we already tried to do in 4Suite. When we first started writing the 4Suite repository, we built XSLT scripting into its very DNA. You can do just about anything to the repository through a set of specialized XSLT extensions. You can write entire Web sites (including CRUD function) in XSLT, and indeed, there are quite a few examples of such sites, including:

These are all 100% XSLT sites, implemented using 4Suite's repository as the HTTP server, request processing engine (via XSLT script), database back end, etc. You don't even need to know any Python to write such sites (Wendell Piez proved this when he created The Sonneteer). You just need to know XSLT and the 4Suite repository API as XSLT functions. These APIs mirror the Python APIs (and to some extent the command line and HTTP based API). Mike Olson insisted that this be a core architectural goal, and we got most of the way there. It served the originally narrow needs quite well.

We all saw the limitations of this approach early, but to some extent succumbed to the it-all-looks-like-a-nail fallacy. It was so easy for me, in particular, to churn out 4Suite repo based sites that I used the approach even where it wasn't really the best one. I'm definitely through with all that, though. I've been looking to do a better job of picking the best tool for each little task. This is why I've been so interested in CherryPy lately. I'd rather use a specialized tool for HTTP protocol handling than the just-good-enough version in 4Suite that is specialized for handing off requests to XSLT scripts. Lately, I've found myself advising people against building end-to-end sites in 4Suite. This is not to strand those who currently take advantage of this approach, but to minimize the legacy load as I try to steer the project towards a framework with better separation of concerns.

When I look at how such a framework would feel, I think of my own emerging choices in putting together such sites:

  • 4Suite repository for storing and metadata processing of XML documents
  • Amara for general purpose XML processing in Python code
  • XSLT for presentation logic
  • CherryPy for protocol serving

My own thoughts for a 4Suite 2.0 are pretty much grounded in this thinking. I want to separate the core libraries (basically XML and RDF) from the repository, so that people can use these separately. I also want to de-emphasize the protocol server of the repository, except as a low-level REST API. That having been said, I'm far from the only 4Suite developer and all I can do is throw in my own vote on the matter.

But back to the point raised, the idea of parallel functions in XSLT and another language, say Python, is reasonable, as long as you don't fall into the trap of trying to wrench XSLT too far off its moorings. I think that XSLT is a presentation language, and that it should always be a presentation language. With all the big additions in XSLT 2.0, the new version is still really just a presentation language. Based on my own experience I'd warn people against trying to strain too hard against this basic fact.

And remember that if you make XSLT a full model logic language, that you have not solved the problem of portability. After all, if you depend on extensions, you're pretty much by definition undermining portability. And if XSLT becomes just another competitor to Java, ECMAscript, Python, etc., then you haven't gained any portability benefit, since some people will prefer to use different languages from XSLT in their model logic, and XSLT code is obviously not portable to these other languages. Sure you have platform portability, but you already have this with these other languages as well.

[Uche Ogbuji]

via Copia

No religious conversion to XML

Sylvain Hellegouarch's comments always seem to require another full blog entry for further discussion (that's a good thing: he asks good questions). In response to "Why support template-like output in Amara?", he said:

Regarding the point of bringing developers who dislike XML into the X-technology world, I think it's useful but I hope you won't try too hard. Whatever tools you could bring to them and how hard you may try, if they have a bad feeling about XML & co., you won't be able to change their mind.

That's not really what I meant. I don't go for religious conversions. The issue is not that there are people out there who will never have anything to do with XML. That's fine. The issue is that some people hate XML but at the same time have no choice but to use XML. You hear a lot of comments such as "I hate that stupid XML, but my job requires me to use it". XML is everywhere (I certainly agree it's overused) and most developers cannot avoid XML even if they dislike it. The idea is to give them sound XML tools that feel right in Python, so that they don't shoot themselves in the foot with kludgery such as parsing with regex, or even the infamous:

print "<foo>", spam, "</foo>"

Aside: if anyone who has to deal with XML is not aware of all the myriad ways that the above will bite you in the arse, they should really read "Proper XML Output in Python". The idea is that tools like Amara don't all of a sudden make people like XML, but rather it makes XML safer and easier for people who hate it. Of course it also makes things easier for people who like it, like me.

I "categorise" people who don't like XML into three sections :

  • Those who never tried and simply judge-before-you-taste.
  • Those who tried XML but didn't use it for the right purpose. Some people only see XML as a language used by some dark J2SE application servers for their configuration file. They don't realise that XML is also a meta language that has brought some other fantastic tools to store, describe, transform, validate, query data.
  • Those who simply react to the hype XML had had in the last 5 years. A bit like when you here during months that a movie you haven't seen at the cinema is fantastic and that you should really watch it. You get so tired of hearing it that you don't want to watch it.

Nice classification. I think the good and the bad of XML is that it has brought so many areas of interest together. As I say in this Intel developer journal article:

XML was a development of the document management community: a way to put all their hard-won architectures on the wide, enticing Web, but when it burst on to the scene, the headlines proclaimed a new king of general-purpose data formats. Programmers pounced on XML to replace their frequent and expensive invention of one-off data formats, and the specialized parsers that handled these formats. Web architects seized XML as a suitable way to define content so that presentation changes could be made cleanly and easily. Database management experts adopted XML as a way to exchange data sets between systems as part of integration tasks. Supply-chain and business interchange professionals embraced XML as an inexpensive and flexible way to encode electronic transactions. When so many disparate disciplines find themselves around the same table, something special is bound to happen.

XML itself is not very special. It represents some refinement, some insight, and many important tradeoffs, but precious little innovation. It has, however, become one of the most important new developments in information systems in part because of the fact that so many groups have come to work with XML, but also because it has focused people's attention to important new ways of thinking about application development.

The reason XML is overhyped is because we live in the age of hype. People don't know how to say "X is useful" any more. They're rather say "it's the gods' solution to every plague released by Pandora" or they say "It's the plaything of the guardians of every circle of Hell". XML is neither, of course. It's useful because it happens to be one data format that is respectable in a wide variety of applications. But like any compromise solution, it is bound to have some weaknesses in each specific area.

[Uche Ogbuji]

via Copia

Why support template-like output in Amara?

When I posted "Sane template-like output for Amara", Sylvain Hellegouarch asked:

I feel like you are on about to re-write XSLT using Python only and I wonder why.

I mean for instance, one of the main reason I'm using XSLT in the first place is that whatever programming language I am using, I don't need to learn a new templating language over and over again. I can simply extend my knowledge of only one : XSLT, and then become better in that specific one.

It also really helps me making a difference between the presentation and the logic upon data since my the logic resides within the programming language itself, not the templating language.

Therefore, although your code looks great, I don't feel confident using it since it would go against what I just said above.

This is a question well worth further discussion.

The first thing is that I have always liked XSLT 1.0, and I still do. Nothing I'm doing in Amara is intended to replace XSLT 1.0 for me. However, there are several things to consider. Firstly, Python programmers seem to have a deep (and generally inexplicable) antipathy towards XML technology. I often hear Pythoneers saying things like "I hate XML because it's over-hyped and used all the time, even when it's not the best choice". Well, this applies to every popular software technology from SQL to Python itself. Nothing new in the case of XML. But never mind all that: Pythoneers don't like XML, and it very often drives them to abuse XML (perhaps if something you dislike is ubiquitous, using it poorly seems a small measure of revenge?) Anyway, someone who doesn't like XML is never going to even look slant-wise at XSLT. One reason for making it easy to do in Amara the sorts of things Sylvain and I may prefer to do in XSLT is to accommodate other preferences.

The second thing to consider is that even if you do like XSLT (1.0 or 2.0), there are places where it's best, and places where it's not. I think the cleanest flow for Web output pipelines can be described by the following diagram. I call this the rendering pipeline pattern (no not "pattern" as in big-hype-ooh-this-is-crazy-deep, but rather "pattern" as in I-do-this-sorta-thing-often-and-here's-what-it-looks-like).

Aside: I've been trying out OpenOffice.org 2.0 beta, and I'm not sure what's up with the wrong-spelling squiggly on the p word. I also wish I didn't have to use screen capture as a poor man's export to PNG.

Separation of model from view should be in addition to separation of content form presentation, and this flow covers both. For my purposes the source data can be a DBMS, flat files, or whatever. The output content is usually an XML format specially designed to describe the information to be presented (the view) in the idiom of the problem domain. The rendered output is usually HTML, RSS, PDF, etc.

In this case, I would use some such as the proposed output API for Amara in the first arrow. Python code would handle the model logic processing and Amara would provide for convenient generation of the output XML. If some of the source data is in XML, then Amara would help further by making that XML a cinch to parse and process. I would use XSLT for the second arrow, whether on the server or, when it is feasible, through the browser.

The summary is that XSLT is not suitable for all uses that result in XML output. In particular it is not generally suitable for model logic processing. Therefore, it is useful for the tools you do use for model logic processing to have good XML APIs, and one can borrow the best bits of XSLT for this purpose without seeking to completely replace XSLT. That's all I'm looking to do.

[Uche Ogbuji]

via Copia

Python/XML community: Amara, lxml and Picket

Amara XML Toolkit 1.0b3
lxml 0.7
Picket 0.4

Amara XML Toolkit 1.0b3 "is a collection of Python tools for XML processing—not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python. Amara builds on 4Suite [http://4Suite.org], but whereas 4Suite focuses more on literal implementation of XML standards in Python, Amara focuses on Pythonic idiom." In this release:

  • Add xmlsetattribute method to elements, in order to allow adding attributes with namespaces or with illegal Python names
  • Update manual source for markdown, and extensive improvements to the manual (with much help from Jamie Norrish)
  • Add xml_doc facility for nodes
  • Fix support for output parameters in xml()
  • Add support for rules to pushbind
  • Improve XSLT support for bindery objects (see demo/bindery/xslt.py)
  • Bug fixes

lxml 0.7 is an alternative, more Pythonic binding for the libxml2 and libxslt XML processing libraries. Martijn Faassen says "lxml 0.7 is a release with quite a few new features and bug fixes, including XPath expression parameters, XInclude support, XMLSchema validation support, more namespace prefix support, better encoding support, and more."

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update is mostly in order to support CherryPy development snapshots that are soon to become CherryPy 2.1. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

[Uche Ogbuji]

via Copia

Sane template-like output for Amara

In an earlier entry I showed some Amara equivalents for XSLT 2 and XQuery examples. I think the main disadvantage of Amara in these cases was the somewhat clumsy XML output generation syntax. This is not an easy problem to fix. XSLT and XQuery basically work XML syntax directly into the language, to make output specification very seamless. This makes sense as long as they stick to the task of being very determinate black boxes taking one body of XML data and working it into another. But often you turn to a language like Python for XML processing because you want to blow the lid off the determinate black boxes a bit: you want to take up all the power of general-purpose computing.

With this power comes the need to streamline and modularize, and the usual first principle for such streamlining is the principle of separating the model from presentation. This is a much easier principle to state than to observe in real-life processing scenarios. We love template languages for XML and HTML generation because they are so convenient in solving real problems in the here and now. We look askance at them, however, because we know that they come with a tendency to mix model and presentation, and that we might regret the solution once it comes time to maintain it when (as inevitable) model processing requirements change or presentation requirements change.

Well, that was a longer preamble than I'd originally had in mind, but it's all boils down to my basic problem: how do I make Amara's output mechanism more readable without falling into the many pitfalls of template systems?

Here is one of the XSLT 2 examples:

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara 1.0b3 it goes something like:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

The following snippet is a good example:

person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)

If I could turn all this into:

newdoc.xml_append_template("<person name='{person.name}'/>")

This would certainly be a huge win for readability. The curly brackets are borrowed from XSLT attribute value templates (AVTs), except that their contents are a Python expression rather than an XPath. The person element created is empty for now, but it becomes just part of the data binding and you can access it using the expected newdoc.person or newdoc.person.name.

One important note: this is very different from `"<person name='% s'/>"%(person.name)`. What I have in mind is a structured template that must be well-formed (it can have multiple root elements). The replacement occurs within the perfectly well-formed XML structure of the template. As with XSLT AVTs you can represent a literal curly bracket as {{ or }}.

The other output generation part in the example:

for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )

Would become

for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")

This time we have the template substitution going on in the content rather than an attribute. Again I would want to restrict this entire idea to a very clean and layered template with proper XML semantics. There would be no tricks such as "<{element_name}>spam</{element_name}>". If you wanted that sort of thing you could use the existing API such as xml_element(element_name), or even use Python string operations directly.

The complete example using such a templating system would be:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        newdoc.xml_append_template("<person name='{person.name}'/>")
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")
    #All done.  Print out the resulting document
    print newdoc.xml()

I think that this example is indeed more readable than the XSLT version.

One tempting thing about this idea is that all the building blocks are there. 4Suite already gives me the ability to parse and process this template very easily, and I could implement this logic without much trouble. But I also think that it deserves some serious thought (and, I hope, feedback from users). There's no hurry: I don't plan to add this capability in the Amara 1.0 cycle. I need to get Amara 1.0 out, and I'm expecting that 1.0b3 is the last stop before a release candidate.

So, things to ponder.

Security. Any time you support such arbitrary-code-in-template features the tainted string worry comes up: what happens if one is not careful with the expression that is used within a template? I think that this issue is not really Amara's responsibility. The developer using Amara should no more pass in untrusted Python expressions to a template than they would to an exec statement. They should be aware that Amara templates will execute arbitrary Python expressions, if they're passed in, and they should apply the usual precautions against tainting.

String or Unicode? Should the templates be specified as strings or Unicode? They are themselves well-formed XML, which makes me think they should be strings (XML is really defined in terms of encoded serialization, and the Unicode backbone is just an abstraction imposed on the actual encoded byte stream). But is this confusing to users? I've always preached that XML APIs should use Unicode, and my products reflect that, and for a user that doesn't understand the nuances, this could seem like a confusing exception. Then again, we already have this exception for 4Suite and Amara APIs that parse XML from strings. My leaning would be to have the template expressed as a string, but to have the results of expressions within templates coerced to Unicode. This is the right thing to do, and that's the strongest argument.

separation of model and presentation. The age-old question with such templates is whether they cause tangles that complicate maintenance. I think one can often make an empirical check for such problems by imagining what happens in a scenario where the data model operations need to change, and another scenario where the presentation needs to change.

As an example of a model change, imagine that the source for the item info was moved from an XML document to a database. I wouldn't need to change any of the templates as long as I could get the same values to pass in, and I think it's reasonable to assume I could do this. Basically, since my templates simply refer to host variables whose computation is nicely decoupled from the template code, the system passes the first test.

As an example of a presentation change, imagine that I now want to generate XHTML directly, rather than this <person><item>... business. I think the system passes this test as well. The templates themselves would have to change, but this change would be isolated from the computation of the host variables used by the templates. Some people might argue that I'm grading these tests too leniently, and that it's already problematic that the computation and presentation occurs so close together, in the same function in the same code file. I'm open to being convinced this is the case, but I'd want to hear of practical maintenance scenarios where this would be a definite problem.

So what do you think?

[Uche Ogbuji]

via Copia

Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples

Since seeing Mike Kay's presentation at XTech 2005 I've been meaning to write up some Amara equivalents to the examples in the paper, "Comparing XSLT and XQuery". Here they are.

This is not meant to be an advocacy piece, but rather a set of useful examples. I think the Amara examples tend to be easier to follow for typical programmers (although they also expose some things I'd like to improve), but with XSLT and XQuery you get cleaner declarative semantics, and cross-language support.

It is by no means always true that an XSLT stylesheet (whether 1.0 or 2.0) is longer than the equivalent in XQuery. Consider the simple task: create a copy of a document that is identical to the original except that all NOTE attributes are omitted. Here is an XSLT stylesheet that does the job. It's a simple variation on the standard identity template that forms part of every XSLT developer's repertoire:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@* except @NOTE"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

In XQuery, lacking an apply-templates instruction and built-in template rules, the recursive descent has to be programmed by hand:

declare function local:copy($node as element()) {
  element {node-name($node)} {
    (@* except @NOTE,
    for $c in child::node
    return typeswitch($c) 
      case $e as element() return local:copy($a)
      case $t as text() return $t
      case $c as comment() return $c
      case $p as processing-instruction return $p
  }
};

local:copy(/*)

Here is Amara code to do the same thing:

def ident_except_note(doc):
    for elem in doc.xml_xpath(u'//*[@NOTE]'):
        del elem.NOTE
    print doc.xml()

Later on in the paper:

...nearly every FLWOR expression has a direct equivalent in XSLT. For example, to take a query from the XMark benchmark:

for    $b in doc("auction.xml")/site/regions//item
let    $k := $b/name
order by $k
return <item name="{$k}">{ $b/location } </item>

is equivalent to the XSLT code:

<xsl:for-each select="doc('auction.xml')/site/regions//item">
  <xsl:sort select="name"/>
  <item name="{name}"
     <xsl:value-of select="location"/>
  </item>
</xsl:for-each>

In Amara:

def sort_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    items = doc.xml_xpath(u'/site/regions//item')
    items.sort()
    for item in items:
        newdoc.xml_append(
            newdoc.xml_element(u'item', content=item)
        )
    newdoc.xml()

This is the first of a couple of examples from XMark. To understand the examples more fully you might want to browse the paper, "The XML Benchmark Project". This was the first I'd heard of XMark, and it seems a pretty useful benchmarking test case, except that it's very heavy on records-like XML (not much on prosy, narrative documents with mixed content, significant element order, and the like). As, such I think it could only ever be a sliver of one half of any comprehensive benchmarking framework.

I think the main thing this makes me wonder about Amara is whether there is any way to make the element creation API a bit simpler, but that's not a new point for me to ponder, and if I can think of anything nicer, I'll work on it post 1.0.

Kay's paper next takes on more complex example from XMark: "Q9: List the names of persons an the names of items they bought in Europe". In database terms this is a joins across person, closed_auction and item element sets. In XQuery:

for $p in doc("auction.xml")/site/people/person
let $a := 
   for $t in doc("auction.xml")/site/closed_auctions/closed_auction
   let $n := for $t2 in doc("auction.xml")/site/regions/europe/item
                       where  $t/itemref/@item = $t2/@id
                       return $t2
       where $p/@id = $t/buyer/@person
       return <item> {$n/name} </item>
return <person name="{$p/name}">{ $a }</person>

Mike Kay's XSLT 2.0 equivalent.

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each 
        select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath(u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

I think the central loop in this case is much clearer as a Python list comprehension than in either the XQuery or XSLT 2.0 case, but I think Amara suffers a bit from the less literal element creation syntax, and for the need to "cast" to Unicode. I would like to lay out cases where casts from bound XML structures to Unicode make sense, so I can get user feedback and implement accordingly. Kay's final example is as follows.

The following code, for example, replaces the text see [Kay, 93] with see Kay93.

<xsl:analyze-string select="$input" regex="\[(.*),(.*)\]">
<xsl:matching-substring>
  <citation>
    <author><xsl:value-of select="regex-group(1)"/></author>
    <year><xsl:value-of select="regex-group(2)"/></year>
  </citation>
</xsl:matching-substring>
<xsl:non-matching-substring>
  <xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>

The only way of achieving this transformation using XQuery 1.0 is to write some fairly convoluted recursive functions.

Here is the Amara version:

import re
PATTERN = re.compile(r'[(.*),(.*)]')
def repl_func(m):
    citation = doc.xml_element(u'item')
    citation.xml_append(doc.xml_element(u'author', content=m.group (1)))
    citation.xml_append(doc.xml_element(u'year', content=m.group (2)))
    return citation.xml(omitXmlDeclaration=u'yes')
text = u'see [Kay, 93]'
print PATTERN.subn(repl_func, text)

I think this is very smooth, with the only possible rough patch again being the output generation syntax.

I should mention that Amara's output syntax isn't really bad. It's just verbose because of its Python idiom. XQuery and XSLT have the advantage that you can pretty much write XML in-line into the code (the templating approach), whereas Python's syntax doesn't allow for this. There has been a lot of discussion of more literal XML template syntax for Python and other languages, but I tend to think it's not worth it, even considering that it would simplify the XML generation syntax of tools such as Amara. Maybe it would be nice to have a lightweight templating system that allows you to write XSLT template chunks in-line with Amara code for data processing, but then, as with most such templating systems, you run into issues of poor model/presentation separation. Clearly this is a matter for much more pondering.

[Uche Ogbuji]

via Copia

Scattered notes from XTech

XTech 2005. Amsterdam. Lovely time. But first of all, I went for a conference. Edd Dumbill outdid himself this time. The first coup de maître was sculpting the tracks to increase the interdisciplinary energy of the meet. The browser track brought out a lot of new faces and provided a jolt of energy. There did seem to be a bit of a divide between the browser types and the XML types, but only as much as one would expect from the fact that XML types tend to know each other, and ditto browser types. There was plenty of crosstalk between the disciplines as well.

Second touch: focus on open data, and all the excitement in that area (Creative Commons, remixing/mash-ups, picture sharing, multimedia sharing, microformats, Weblogging, content syndication, Semantic technology, podcasting, screencasting, personal information spaces, corporate info spaces, public info spaces, etc.) and watch the BBC take over (with they bad selves). And don't fret: "damn, maybe we should lighten up on the BBC bias it he speakers". No, just go with it. Recognize that they are putting forth great topics, and that everyone is amped about how the BBC is leading the way on so many information technology and policy fronts.

Third touch: foster collaboration. Put up a Wiki, encourage folks to an IRC channel, aggregate people's Weblog postings and snapshots into one place, Planet XTech, and cook up a fun little challenge to go with the theme of open data. For that last bit Edd put out an XML representation of the conference schedule and asked folks to do something cool with it. I didn't do as much with it as I'd hoped. When I finally got my presentation going I used the posted grid.xml as a demo file for playing with Amara, but I wished it had more content, especially mixed content (it's very attribute heavy). I've suggested on the XTech Wiki that if Edd does the same thing next time, that he work in paper abstracts, or something like that, to get in more text content.

I said "When I finally got my presentation going", which hints at the story of my RAI (venue for XTech) jinx. Last year in Amsterdam I couldn't get my Dell 8600 running Fedora Core 3 to agree with the projectors at the RAI. As Elliotte Rusty Harold understates in his notes from the 2004 conference:

After some technical glitches, Uche Ogbuji is talking about XML good practices and antipatterns in a talk entitled "XML Design Principles for Form and Function"

In fact I ended up having to switch to OpenOffice on Windows, and the attendees endured a font only a hippie could love (Apparently Luxi Sans is not installed by default on Windows and OO/Win has a very strange way of finding a substitute). I'm vain enough not to miss quoting another bit about my talk from Elliotte:

A very good talk. I look forward to reading the paper. FYI, he's a wonderful speaker; probably the best I've heard here yet.

Gratifying to know I managed a good talk under pressure. I hope I did so again this time, because the RAI projectors were no more friendly. The topic was "Matching Python idioms to XML idioms". Remembering the last year's headache I asked about a projector to use to test out my presentation (I was on the first day, Weds). Usually conference speakers' rooms have a spare projector for this purpose, but it looks as if the RAI couldn't supply one. I crossed my fingers and arrived for my talk the dutiful 15 minutes early. Eric van der Vlist was up before me in the block. The AV guy came along and they spent quite a while struggling with Eric's laptop (Several speakers had trouble with the RAI projectors). They finally worked out a 640x480 arrangement that caused him to have to pan around his screen. This took a while, and the AV guy bolted right afterward and was not there to help me set up my own laptop. Naturally, neither I nor the very helpful Michel Biezunski (our session chair) were able to get it to work, and we had to turn things over to Eric to start his talk.

We then both went in search of the AV guy, and it took forever to find him. No, they didn't have a spare projector that we could use to set up my laptop in time for my talk. We'd just have to wait for Eric to finish and hope for the best (insert choice sailor's vocabulary here). My time slot came and we spent 20 minutes trying every setting on my laptop and every setting on their projector. The AV guys (yeah, when it was crisis time, they actually found they had more than one) muttered taunts about Linux, and it's a lucky thing I was bent on staying calm. I present quite often, and I do usually have to try out a few settings to get things to work, but in my encounters it's only the RAI projectors that seem completely incapable to project from my Linux laptop. In all, I witnessed 4 speakers (3 on Linux and surprisingly one on Mac OS X) who had big problems with the RAI projectors, including one of the keynote speakers. I suspect others had problems as well.

I couldn't take the obvious escape route of borrowing someone else's laptop because the crux of my talk was a demo of Amara and I'd have to install all that as well (Several kind volunteers including Michel had 4Suite installed, but not Amara). After 20 minutes, we agreed that I'd go on with my talk on Michel's computer (Thinkpad running Red Hat 9 and it worked with the projector immediately!), skip the demo, and we'd find another time slot for me to give the entire talk the next day. Quite a few people stuck around through this mess and I'm grateful to them.

The next day we installed Amara on Michel's computer and I gave the presentation in its proper form right after lunch. There was great attendance for this reprise, considering everything. The Amara demo went fine, except that the grid.xml I was using as a sample gave too few opportunities to show off text manipulation. I'll post a bit later on thoughts relating to Amara, stemming from the conference. Norm Walsh was especially kind and encouraging about my presentation woes, and he has also been kind in his notes on XTech 2005:

The presentation [deities] did not smile on Uche Ogbuji. He struggled mightily to get his presentation of Matching Python Idioms to XML Idioms off the ground. In vain, as it turned out (AV problems were all too common for a modern conference center), but he was generous enough try again the next day and it was worth it (thanks Uche!). I'm slowly becoming a Python convert and some of the excellent work that he's done at Fourthought to provide Python access to standard XML in ways that feel natural in Python is part of the appeal.

That's the precise idea. A tool for processing XML that works for both Python heads and XML heads. The whole point of my presentation was how hard this is to accomplish, and how just about every Python tool (including earlier releases of 4Suite) accommodates one side and not the other. The response to Amara from both the Python heads and XML heads makes me feel I've finally struck the right balance.

I got a lot out of the other XTech talks. Read Norm on the keynotes: he pretty much had the same impressions as I did. Props to Michael Kay for his great presentation comparing XSLT 2.0 and XML Query. I took enough notes at that one for a separate entry, which will follow this one. I missed a lot of the talks between Kay's and my own while I was trying (unsuccessfully) to head off the AV gremlins.

Other talks to highlight: Jon Trowbridge's on Beagle (who, you guessed it, had AV problems that ate up a chunk of his time slot). From the project Wiki:

Beagle is a search tool that ransacks your personal information space to find whatever you're looking for. Beagle can search in many different domains: documents, emails, web history, IM/IRC conversation, source code, images, music files, applications and [much more]

Edd had already introduced me to Beagle, but it was really cool to see it in action. I'll have to check it out. Jon also pointed out TomBoy, "a desktop note-taking application for Linux and Unix. Simple and easy to use, but with potential to help you organize the ideas and information you deal with every day." Two projects I'll have to give a spin. Props to Jon for shrugging off the AV woes and giving a fun and relaxed talk.

Robert O'Callahan's talk on the new canvas tag for Mozilla and Safari was memorable if for nothing else than the experience of surfing Google at a 45° angle, with no apparent loss in snappiness. This canvas thingie looks wicked cool, and it's good to see them working to incorporate SVG. I've heard a lot of grumbling from W3C types about canvas, and all we poor browser users in the middle can hope for is some rapid conversion of cool technologies such as XAML, XUL, canvas, SVG, etc. Others have blogged about the opportunities and anxieties opened up by the WHATWG, which one commentator said should have been the "WHAT Task Force" because "WTF" would have been a better acronym. I'm a neutral in these matters, except that I really do with browser folks would do what they can to push people along to XHTML 2.0 rather than cooking up HTML 5.0 and such.

Matt Biddulph was one of the BBC Massive on hand, and his talk "The Application of Weblike Design to Data - Designing Data for Reuse" offered a lot of practical tips on how to usefully open up a large body of data from a large organization.

Dominique Hazaël-Massieux gave a talk on GRDDL (O most unfortunate project name), which was my first hearing of the technology. My brief characterization of GRDDL is as an attempt to pull the Wild West ethos of microformats into the rather more controlled sphere of RDF. It touches on topics in which I've been active for years, including tools for mapping XML to RDF. I've argued all these years that RDF folks will have to embrace general XML in place of the RDF/XML vocabulary if they are to make much progress. They will have to foster tools that make extracting RDF model data from XML a no-brainer. It's great to see the W3C finally stirring in this direction. Dom's presented very well. I asked about the use of other systems, such as schema annotation, for the XML to RDF mapping. It seemed to me that GRDDL was mostly geared towards XSLT. Dom said it is meant to be independent of the mapping mechanism, but in my post-conference reading I'm not so sure. I'll have to ponder this matter more and perhaps post my thoughts. Dom also mentioned PiggyBank, "the Semantic Web extension for Firefox". Kingsley Idehen has a nice blurb on this software. I do hesitate to use it because someone mentioned to me how PiggyBank had gone into crazy thrash mode at one point. I don't muck with my FireFox set-up lightly.

Rick Jelliffe showed off Topologi's lightweight browser TreeWorld, which is XML-oriented and suitable for embedding into other applications.

Others have blogged Jean Paoli's closing keynote (http://glazman.org/weblog/dotclear/index.php?2005/05/29/1059-adam-3">Leigh, etc.). Seems I'm not the only one who was put off by the straight-up product pitch. At least he did a bit of a service by clearly saying "Binary XML: No please". Check out more quotes from XTech.

The conference was superb. Do be sure not to miss it next year. It's looking like Amsterdam will be the venue again. And what of Amsterdam? Besides the conference I had a great time with friends. I'll post on that later.

For the most comprehensive report I've seen to date, see Micah Dubinko's article.

[Uche Ogbuji]

via Copia

Off to Amsterdam (XTech), and a note about comments on Copia

Well, I'm heading off to catch the flight to Amsterdam for XTech 2005. I'll blog as much as I can, and I have some FOSS work to do as well, on Amara, especially, to prep the 1.0b2 release.

We've had the spam comment folks doing their thing here, and so far I've been able to keep them mostly in check by deleting them soon after they appear. The trip will probably leave too big a hole for them, so for now I've turned on draft mode for comments. All comments will be held until explicitly approved. I apologize for any inconvenience. I've been tinkering on a more solid spam fighting system, building on the great work others have done on black-listing the punks.

[Uche Ogbuji]

via Copia