Copia

Another 4Suite sighting

"pod.py - A minimal cross-platform “podcatcher”"—Randi Mooney

I’ve been listening to a lot of podcasts recently[...]. The standard podcast reciever is iPodder, a very feature rich program that is just too bloated for my needs: I want a cross platform downloader that can be scheduled from UNIX cron and works from the command line.

I went hunting for an alternative client[...]. Of course, what I really wanted was a Python based podcast reciever.

So I created pod.py - a pure Python Podcast reciever. It depends on the excellent 4suite XML processing library to do all the hard XML processing[...].

[Uche Ogbuji]

via Copia

Python/XML community: Amara, lxml and Picket

Amara XML Toolkit 1.0b3
lxml 0.7
Picket 0.4

Amara XML Toolkit 1.0b3 "is a collection of Python tools for XML processing—not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python. Amara builds on 4Suite [http://4Suite.org], but whereas 4Suite focuses more on literal implementation of XML standards in Python, Amara focuses on Pythonic idiom." In this release:

Add xmlsetattribute method to elements, in order to allow adding attributes with namespaces or with illegal Python names
Update manual source for markdown, and extensive improvements to the manual (with much help from Jamie Norrish)
Add xml_doc facility for nodes
Fix support for output parameters in xml()
Add support for rules to pushbind
Improve XSLT support for bindery objects (see demo/bindery/xslt.py)
Bug fixes

lxml 0.7 is an alternative, more Pythonic binding for the libxml2 and libxslt XML processing libraries. Martijn Faassen says "lxml 0.7 is a release with quite a few new features and bug fixes, including XPath expression parameters, XInclude support, XMLSchema validation support, more namespace prefix support, better encoding support, and more."

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update is mostly in order to support CherryPy development snapshots that are soon to become CherryPy 2.1. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

[Uche Ogbuji]

via Copia

Running multiple python versions in home directory install

I've had to do this several times in order to install 4Suite to a locally installed python directory on a server I didn't have SU on - or was managing the packages with a utility that wasn't so good with python packages (yum, apt sometimes). I had to dig into the old akara IRC logs for these tid-bits.

uche_: Here are the contents of ~jkloth/bin/python2.4 on Jeremy's computer:

#!/bin/bash
export PYTHONPATH=$HOME/lib/python2.4
exec /usr/bin/python2.4 "$@"

uche_: This assumes $HOME/bin is on your path and before where ever your python is installed

chimezie@WuChi 4Suite $ $HOME/bin/python2.4 setup.py config --home=$HOME
chimezie@WuChi 4Suite $ $HOME/bin/python2.4 setup.py install
chimezie@WuChi 4Suite $ which 4ss_manager
/home/chimezie/bin/4ss_manager

chimezie: In order to run 4Suite scripts in this isolated environment, you need to execute them with the $HOME/bin/ explicitely

chimezie@WuChi devel $ $HOME/bin/python2.4 /home/chimezie/bin/4ss_manager start -n  -u <user> -p <password>
.. snip log ..
Apr 10 18:31:05 Controller: [notice] Controller configured -- resuming normal operations

[Uche Ogbuji]

via Copia

Sane template-like output for Amara

In an earlier entry I showed some Amara equivalents for XSLT 2 and XQuery examples. I think the main disadvantage of Amara in these cases was the somewhat clumsy XML output generation syntax. This is not an easy problem to fix. XSLT and XQuery basically work XML syntax directly into the language, to make output specification very seamless. This makes sense as long as they stick to the task of being very determinate black boxes taking one body of XML data and working it into another. But often you turn to a language like Python for XML processing because you want to blow the lid off the determinate black boxes a bit: you want to take up all the power of general-purpose computing.

With this power comes the need to streamline and modularize, and the usual first principle for such streamlining is the principle of separating the model from presentation. This is a much easier principle to state than to observe in real-life processing scenarios. We love template languages for XML and HTML generation because they are so convenient in solving real problems in the here and now. We look askance at them, however, because we know that they come with a tendency to mix model and presentation, and that we might regret the solution once it comes time to maintain it when (as inevitable) model processing requirements change or presentation requirements change.

Well, that was a longer preamble than I'd originally had in mind, but it's all boils down to my basic problem: how do I make Amara's output mechanism more readable without falling into the many pitfalls of template systems?

Here is one of the XSLT 2 examples:

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara 1.0b3 it goes something like:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

The following snippet is a good example:

person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)

If I could turn all this into:

newdoc.xml_append_template("<person name='{person.name}'/>")

This would certainly be a huge win for readability. The curly brackets are borrowed from XSLT attribute value templates (AVTs), except that their contents are a Python expression rather than an XPath. The person element created is empty for now, but it becomes just part of the data binding and you can access it using the expected newdoc.person or newdoc.person.name.

One important note: this is very different from `"<person name='% s'/>"%(person.name)`. What I have in mind is a structured template that must be well-formed (it can have multiple root elements). The replacement occurs within the perfectly well-formed XML structure of the template. As with XSLT AVTs you can represent a literal curly bracket as {{ or }}.

The other output generation part in the example:

for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )

Would become

for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")

This time we have the template substitution going on in the content rather than an attribute. Again I would want to restrict this entire idea to a very clean and layered template with proper XML semantics. There would be no tricks such as "<{element_name}>spam</{element_name}>". If you wanted that sort of thing you could use the existing API such as xml_element(element_name), or even use Python string operations directly.

The complete example using such a templating system would be:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        newdoc.xml_append_template("<person name='{person.name}'/>")
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")
    #All done.  Print out the resulting document
    print newdoc.xml()

I think that this example is indeed more readable than the XSLT version.

One tempting thing about this idea is that all the building blocks are there. 4Suite already gives me the ability to parse and process this template very easily, and I could implement this logic without much trouble. But I also think that it deserves some serious thought (and, I hope, feedback from users). There's no hurry: I don't plan to add this capability in the Amara 1.0 cycle. I need to get Amara 1.0 out, and I'm expecting that 1.0b3 is the last stop before a release candidate.

So, things to ponder.

Security. Any time you support such arbitrary-code-in-template features the tainted string worry comes up: what happens if one is not careful with the expression that is used within a template? I think that this issue is not really Amara's responsibility. The developer using Amara should no more pass in untrusted Python expressions to a template than they would to an exec statement. They should be aware that Amara templates will execute arbitrary Python expressions, if they're passed in, and they should apply the usual precautions against tainting.

String or Unicode? Should the templates be specified as strings or Unicode? They are themselves well-formed XML, which makes me think they should be strings (XML is really defined in terms of encoded serialization, and the Unicode backbone is just an abstraction imposed on the actual encoded byte stream). But is this confusing to users? I've always preached that XML APIs should use Unicode, and my products reflect that, and for a user that doesn't understand the nuances, this could seem like a confusing exception. Then again, we already have this exception for 4Suite and Amara APIs that parse XML from strings. My leaning would be to have the template expressed as a string, but to have the results of expressions within templates coerced to Unicode. This is the right thing to do, and that's the strongest argument.

separation of model and presentation. The age-old question with such templates is whether they cause tangles that complicate maintenance. I think one can often make an empirical check for such problems by imagining what happens in a scenario where the data model operations need to change, and another scenario where the presentation needs to change.

As an example of a model change, imagine that the source for the item info was moved from an XML document to a database. I wouldn't need to change any of the templates as long as I could get the same values to pass in, and I think it's reasonable to assume I could do this. Basically, since my templates simply refer to host variables whose computation is nicely decoupled from the template code, the system passes the first test.

As an example of a presentation change, imagine that I now want to generate XHTML directly, rather than this <person><item>... business. I think the system passes this test as well. The templates themselves would have to change, but this change would be isolated from the computation of the host variables used by the templates. Some people might argue that I'm grading these tests too leniently, and that it's already problematic that the computation and presentation occurs so close together, in the same function in the same code file. I'm open to being convinced this is the case, but I'd want to hear of practical maintenance scenarios where this would be a definite problem.

So what do you think?

[Uche Ogbuji]

via Copia

Versa Diagrams

I updated the Versa by Deconstruction document with diagrams for the forward traversal operator and the distribute function (probably the two most difficult / fundamental components of Versa). They are both below:

Distribute - as Cartesian Product

[Uche Ogbuji]

via Copia

We need more solid guidelines for i18n in OSS projects

Every time I'm about to tackle i18n in some project, I find myself filled with a bit of trepidation. I actually know more than the average developer on the topic, but it's a hard topic, and there is quite the mystery around it.

One of the reasons for this mystery is that there are really few good resources to guide FOSS developers through the process of internationalizing their apps. Gettext is the most common means for i18n and l10n in FOSS apps. The master resource for this is the GNU gettext manual, but that's a lot to bite off. Poking around on Google, I found a useful gettext overview for PHP folks, the Ruby on Rails folks have a nice chapter on the it and there are a few other general intros (1, 2, 3). Luis Miguel Morillas pointed out some for GNOME, KDE and wxPython (1, 2).

Python has the usual set of gettext-based facilities, which is great, but they are woefully documented. In the library reference's section on gettext you get a purported overview and a bunch of scattered notes on the API, but nothing that really coherently leads the developer through the concepts and process of i18n, as the PHP and Ruby folks seem to have. It doesn't even seem as if buying a book helps much. The books I most often recommend as Python intros don't seem to touch the subject, and the reference books seem to go not much deeper than the Python docs. Even David Mertz's useful Text Processing in Python doesn't cover i18n (this surprised me).

My recent foray into i18n actually straddled Python and XML worlds. For XMLers, there are a few handy resources:

XLIFF is of particular interest, but I decided not to use it for i18n in 4Suite/XSLT because I wanted to base my work on the Python facilities, which are well tested and based on the de facto standard gettext.

Anyway, am I missing something? Are there all sorts of great resources out there that would slide Python developers right into the i18n groove?

[Uche Ogbuji]

via Copia

i18n for XSLT in 4Suite

Prodded by discussion on the CherryPy list I have implemented and checked in a 4Suite XSLT extension for internationalization using Python's gettext facilities for the underlying support. Here is how it works. Sample XML:

<doc>
  <msg>hello</msg>
  <msg>goodbye</msg>
</doc>

Sample XSLT:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:f="http://xmlns.4suite.org/ext"
  extension-element-prefixes="f"
>
  <f:setup-translations domain="test"/>
  <xsl:template match="msg">
    <f:gettext><xsl:apply-templates/></f:gettext>
  </xsl:template>
</xsl:stylesheet>

The f:setup-translations and f:gettext extension elements are the key. The former looks up and installs the domain "test" for use in your XSLT. Replace it with the domain used by your application. The latter extension evaluates its body to get a string value, and then looks up this string in the installed translation.

Assuming you have a test.mo installed in the right place, say that translates "hello" to "howdy" and "goodbye" to "so long".

$ 4xslt test.xml test.xsl
<?xml version="1.0" encoding="UTF-8"?>
howdy
  so long

I trimmed some white space for formatting, but you get the idea. The translations are applied automatically according to your locale.

This operates via Python's gettext facilities which means that it's much more efficient than, say, the docbook XSLT approach to i18n.

For those who want to give it a whirl, here's a quick step-by-step. All the needed files are available here on copia.

Create a sandbox locale directory:

mkdir -p /tmp/locale/en_US/LC_MESSAGES/

Copy in the catalog. You may need to create a different catalog for your own language if your system will not be selecting en_US as locale (remember that you can hack the locale via the environment)

cp en_US.mo /tmp/locale/en_US/LC_MESSAGES/test.mo

Your locale is probably not en_US. If not, you can:

temporarily override your locale to en_us using export LANG=en_US, or the equivalent command for your shell
create translations for your locale (just two strings to translate). I use poedit, which is makes dealing with .pos simple enough. Then replace en_US in all the above instructions with your own locale and the .mo file you created.

Anyway, the f:setup-translations and f:gettext extensions are now checked into 4Suite. You can either update to current 4Suite CVS, or just download the one changed file, Ft/Xml/Xslt/BuiltInExtElements.py and copy it into your 4Suite codebase. It works fine as a drop-in to 4Suite 1.0b1.

[Uche Ogbuji]

via Copia

Pythonic SPARQL API over rdflib

I've recently been investigating the possiblity of adapting an existing SPARQL parser/query engine on top of 4RDF - mostly for the eventual purpose of implementing a sparql-eval Versa extension function - was pleased to see there has already been some similar work done:

SPARQL in RDFLib V2 (usage)
rdflibUtils python package (the implementation)

Although this isn't exactly what I had in mind (the more robust option would be to write an adaptor for Redland's model API and execute SPARQL queries via rasqal ), it provides an interesting pythonic analog to querying RDF.

Chimezie Ogbuji

via Copia

Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples

Since seeing Mike Kay's presentation at XTech 2005 I've been meaning to write up some Amara equivalents to the examples in the paper, "Comparing XSLT and XQuery". Here they are.

This is not meant to be an advocacy piece, but rather a set of useful examples. I think the Amara examples tend to be easier to follow for typical programmers (although they also expose some things I'd like to improve), but with XSLT and XQuery you get cleaner declarative semantics, and cross-language support.

It is by no means always true that an XSLT stylesheet (whether 1.0 or 2.0) is longer than the equivalent in XQuery. Consider the simple task: create a copy of a document that is identical to the original except that all NOTE attributes are omitted. Here is an XSLT stylesheet that does the job. It's a simple variation on the standard identity template that forms part of every XSLT developer's repertoire:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@* except @NOTE"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

In XQuery, lacking an apply-templates instruction and built-in template rules, the recursive descent has to be programmed by hand:

declare function local:copy($node as element()) {
  element {node-name($node)} {
    (@* except @NOTE,
    for $c in child::node
    return typeswitch($c) 
      case $e as element() return local:copy($a)
      case $t as text() return $t
      case $c as comment() return $c
      case $p as processing-instruction return $p
  }
};

local:copy(/*)

Here is Amara code to do the same thing:

def ident_except_note(doc):
    for elem in doc.xml_xpath(u'//*[@NOTE]'):
        del elem.NOTE
    print doc.xml()

Later on in the paper:

...nearly every FLWOR expression has a direct equivalent in XSLT. For example, to take a query from the XMark benchmark:

for    $b in doc("auction.xml")/site/regions//item
let    $k := $b/name
order by $k
return <item name="{$k}">{ $b/location } </item>

is equivalent to the XSLT code:

<xsl:for-each select="doc('auction.xml')/site/regions//item">
  <xsl:sort select="name"/>
  <item name="{name}"
     <xsl:value-of select="location"/>
  </item>
</xsl:for-each>

In Amara:

def sort_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    items = doc.xml_xpath(u'/site/regions//item')
    items.sort()
    for item in items:
        newdoc.xml_append(
            newdoc.xml_element(u'item', content=item)
        )
    newdoc.xml()

This is the first of a couple of examples from XMark. To understand the examples more fully you might want to browse the paper, "The XML Benchmark Project". This was the first I'd heard of XMark, and it seems a pretty useful benchmarking test case, except that it's very heavy on records-like XML (not much on prosy, narrative documents with mixed content, significant element order, and the like). As, such I think it could only ever be a sliver of one half of any comprehensive benchmarking framework.

I think the main thing this makes me wonder about Amara is whether there is any way to make the element creation API a bit simpler, but that's not a new point for me to ponder, and if I can think of anything nicer, I'll work on it post 1.0.

Kay's paper next takes on more complex example from XMark: "Q9: List the names of persons an the names of items they bought in Europe". In database terms this is a joins across person, closed_auction and item element sets. In XQuery:

for $p in doc("auction.xml")/site/people/person
let $a := 
   for $t in doc("auction.xml")/site/closed_auctions/closed_auction
   let $n := for $t2 in doc("auction.xml")/site/regions/europe/item
                       where  $t/itemref/@item = $t2/@id
                       return $t2
       where $p/@id = $t/buyer/@person
       return <item> {$n/name} </item>
return <person name="{$p/name}">{ $a }</person>

Mike Kay's XSLT 2.0 equivalent.

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each 
        select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath(u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

I think the central loop in this case is much clearer as a Python list comprehension than in either the XQuery or XSLT 2.0 case, but I think Amara suffers a bit from the less literal element creation syntax, and for the need to "cast" to Unicode. I would like to lay out cases where casts from bound XML structures to Unicode make sense, so I can get user feedback and implement accordingly. Kay's final example is as follows.

The following code, for example, replaces the text see [Kay, 93] with see Kay93.

<xsl:analyze-string select="$input" regex="\[(.*),(.*)\]">
<xsl:matching-substring>
  <citation>
    <author><xsl:value-of select="regex-group(1)"/></author>
    <year><xsl:value-of select="regex-group(2)"/></year>
  </citation>
</xsl:matching-substring>
<xsl:non-matching-substring>
  <xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>

The only way of achieving this transformation using XQuery 1.0 is to write some fairly convoluted recursive functions.

Here is the Amara version:

import re
PATTERN = re.compile(r'[(.*),(.*)]')
def repl_func(m):
    citation = doc.xml_element(u'item')
    citation.xml_append(doc.xml_element(u'author', content=m.group (1)))
    citation.xml_append(doc.xml_element(u'year', content=m.group (2)))
    return citation.xml(omitXmlDeclaration=u'yes')
text = u'see [Kay, 93]'
print PATTERN.subn(repl_func, text)

I think this is very smooth, with the only possible rough patch again being the output generation syntax.

I should mention that Amara's output syntax isn't really bad. It's just verbose because of its Python idiom. XQuery and XSLT have the advantage that you can pretty much write XML in-line into the code (the templating approach), whereas Python's syntax doesn't allow for this. There has been a lot of discussion of more literal XML template syntax for Python and other languages, but I tend to think it's not worth it, even considering that it would simplify the XML generation syntax of tools such as Amara. Maybe it would be nice to have a lightweight templating system that allows you to write XSLT template chunks in-line with Amara code for data processing, but then, as with most such templating systems, you run into issues of poor model/presentation separation. Clearly this is a matter for much more pondering.

[Uche Ogbuji]

via Copia

Versa by Deconstruction

I was recently compelled to write an introductory companion to the Versa specification. The emphasis for this document (located here) is with readers with little to no experience with formal language specifications and/or with the RDF data model. It is inspired by it's predecessors (which make good follow-up material):

I initially started using Open Office Writer to compose an Open Office Document and export it to an HTML document. But I eventually decided to write it in MarkDown and use pymarkdown to render it to an HTML document stored on Copia.

The original MarkDown source is here.

-- Chimezie

[Uche Ogbuji]

via Copia

Copia

Ogbujis on an abundance of topics

Tag 4suite

Another 4Suite sighting

Python/XML community: Amara, lxml and Picket

Running multiple python versions in home directory install

Sane template-like output for Amara

Versa Diagrams

We need more solid guidelines for i18n in OSS projects

i18n for XSLT in 4Suite

Pythonic SPARQL API over rdflib

Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples

Versa by Deconstruction