The Earliest Juncture of Semiotics and Mathematics

The Trigrams and My Interest

My interest in the trigrams of the very ancient Yijing is mostly scholastic. It's the coherent set of philosophies (or canon), derived from these trigrams and what amounts to a mathematical interpretation of everything that have had a more concrete effect on how I go about my life and how I deal with adversity.

The trigrams are many things, but their most interesting characteristics (from a secular point of view) are their direct analogy to the binary numerical system as well as the fact that they (undisputedely) represent the earliest coherent example of humankind's study of semionics:

the philosophical theory of the functions of sign and symbols

The infinite Characteristics of the Trigrams

The first (and less emphasized) of these two characteristics of the trigrams was formally observed by the German mathematician Gottfried Wilhelm Leibniz (the original observation is probably as old as the purported author of the trigrams: FuXi). He, is the creator of the modern binary system of counting, which is the primary framework upon which microprocessor design is based (an important, historical irony).
He noticed that the concept of duality/balance evident in the trigrams' source (the )) as well as the derived related philosophies are directly analogous to the binary system when you substitute 0 for dashed lines (yin - the concept of no motion) and 1 for unbroken lines (yang - the concept of motion / kinetic energy).

The trigrams are meant to be interpreted from the bottom up, so a continuation of this binary analog would have the reader tip the trigrams over to their right side and read them as binary numbers.

The Binary Analog of the Primary Gua

Below is the original horizontal arrangement of the trigrams with their corresponding binary numbers (click on each to view the corresponding SVG diagram):

Earth - 000 Mountain - 001 Water - 010 Wind - 011 Thunder - 100 Fire - 101 Lake - 101 Heaven - 111

Extension to the 64 Trigrams of the Yijing

Since, the 8 primary gua are the building blocks upon which the 64 symbols of the Yijing are built (and purportedly, everything), this binary analogy can be extended to all the 64 symbols. This is well known amongst scholars of the Yijing and below is the most famous diagram of this extension by Shao Yong (1011AD - 1077AD):

Shao Yong's Diagram

The numerical significance of the trigrams in sequence is well summarized here. This page also includes a very useful animated image of the entire sequence as a binary progression:

FuXi Sequence

The most complete resource on the subject (that I've read so far) is Alfred Huang's The Numerology of the I Ching (ISBN: 0-89281-811-5)

I was unable to embed the SVG diagrams within the page, which is a shame because the yijing trigrams are an excellent SVG use case. I hope to someday capture all 64 as SVG diagrams so the various, more popular philosophical/visual arrangements can be rendered programatically. Imagine Shao Yong's circular diagram as SVG (talk about an interesting combination of ancient numerology with modern vector graphic technology). It would prove quite a useful tool for avid students of the yijing symbols as well as make for some very interesting patterns.

[Chimezie Ogbuji]

via Copia

Sane template-like output for Amara

In an earlier entry I showed some Amara equivalents for XSLT 2 and XQuery examples. I think the main disadvantage of Amara in these cases was the somewhat clumsy XML output generation syntax. This is not an easy problem to fix. XSLT and XQuery basically work XML syntax directly into the language, to make output specification very seamless. This makes sense as long as they stick to the task of being very determinate black boxes taking one body of XML data and working it into another. But often you turn to a language like Python for XML processing because you want to blow the lid off the determinate black boxes a bit: you want to take up all the power of general-purpose computing.

With this power comes the need to streamline and modularize, and the usual first principle for such streamlining is the principle of separating the model from presentation. This is a much easier principle to state than to observe in real-life processing scenarios. We love template languages for XML and HTML generation because they are so convenient in solving real problems in the here and now. We look askance at them, however, because we know that they come with a tendency to mix model and presentation, and that we might regret the solution once it comes time to maintain it when (as inevitable) model processing requirements change or presentation requirements change.

Well, that was a longer preamble than I'd originally had in mind, but it's all boils down to my basic problem: how do I make Amara's output mechanism more readable without falling into the many pitfalls of template systems?

Here is one of the XSLT 2 examples:

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara 1.0b3 it goes something like:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

The following snippet is a good example:

person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)

If I could turn all this into:

newdoc.xml_append_template("<person name='{person.name}'/>")

This would certainly be a huge win for readability. The curly brackets are borrowed from XSLT attribute value templates (AVTs), except that their contents are a Python expression rather than an XPath. The person element created is empty for now, but it becomes just part of the data binding and you can access it using the expected newdoc.person or newdoc.person.name.

One important note: this is very different from `"<person name='% s'/>"%(person.name)`. What I have in mind is a structured template that must be well-formed (it can have multiple root elements). The replacement occurs within the perfectly well-formed XML structure of the template. As with XSLT AVTs you can represent a literal curly bracket as {{ or }}.

The other output generation part in the example:

for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )

Would become

for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")

This time we have the template substitution going on in the content rather than an attribute. Again I would want to restrict this entire idea to a very clean and layered template with proper XML semantics. There would be no tricks such as "<{element_name}>spam</{element_name}>". If you wanted that sort of thing you could use the existing API such as xml_element(element_name), or even use Python string operations directly.

The complete example using such a templating system would be:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        newdoc.xml_append_template("<person name='{person.name}'/>")
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath (u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            newdoc.person.xml_append_template("<item>{item}</item>")
    #All done.  Print out the resulting document
    print newdoc.xml()

I think that this example is indeed more readable than the XSLT version.

One tempting thing about this idea is that all the building blocks are there. 4Suite already gives me the ability to parse and process this template very easily, and I could implement this logic without much trouble. But I also think that it deserves some serious thought (and, I hope, feedback from users). There's no hurry: I don't plan to add this capability in the Amara 1.0 cycle. I need to get Amara 1.0 out, and I'm expecting that 1.0b3 is the last stop before a release candidate.

So, things to ponder.

Security. Any time you support such arbitrary-code-in-template features the tainted string worry comes up: what happens if one is not careful with the expression that is used within a template? I think that this issue is not really Amara's responsibility. The developer using Amara should no more pass in untrusted Python expressions to a template than they would to an exec statement. They should be aware that Amara templates will execute arbitrary Python expressions, if they're passed in, and they should apply the usual precautions against tainting.

String or Unicode? Should the templates be specified as strings or Unicode? They are themselves well-formed XML, which makes me think they should be strings (XML is really defined in terms of encoded serialization, and the Unicode backbone is just an abstraction imposed on the actual encoded byte stream). But is this confusing to users? I've always preached that XML APIs should use Unicode, and my products reflect that, and for a user that doesn't understand the nuances, this could seem like a confusing exception. Then again, we already have this exception for 4Suite and Amara APIs that parse XML from strings. My leaning would be to have the template expressed as a string, but to have the results of expressions within templates coerced to Unicode. This is the right thing to do, and that's the strongest argument.

separation of model and presentation. The age-old question with such templates is whether they cause tangles that complicate maintenance. I think one can often make an empirical check for such problems by imagining what happens in a scenario where the data model operations need to change, and another scenario where the presentation needs to change.

As an example of a model change, imagine that the source for the item info was moved from an XML document to a database. I wouldn't need to change any of the templates as long as I could get the same values to pass in, and I think it's reasonable to assume I could do this. Basically, since my templates simply refer to host variables whose computation is nicely decoupled from the template code, the system passes the first test.

As an example of a presentation change, imagine that I now want to generate XHTML directly, rather than this <person><item>... business. I think the system passes this test as well. The templates themselves would have to change, but this change would be isolated from the computation of the host variables used by the templates. Some people might argue that I'm grading these tests too leniently, and that it's already problematic that the computation and presentation occurs so close together, in the same function in the same code file. I'm open to being convinced this is the case, but I'd want to hear of practical maintenance scenarios where this would be a definite problem.

So what do you think?

[Uche Ogbuji]

via Copia

We need more solid guidelines for i18n in OSS projects

Every time I'm about to tackle i18n in some project, I find myself filled with a bit of trepidation. I actually know more than the average developer on the topic, but it's a hard topic, and there is quite the mystery around it.

One of the reasons for this mystery is that there are really few good resources to guide FOSS developers through the process of internationalizing their apps. Gettext is the most common means for i18n and l10n in FOSS apps. The master resource for this is the GNU gettext manual, but that's a lot to bite off. Poking around on Google, I found a useful gettext overview for PHP folks, the Ruby on Rails folks have a nice chapter on the it and there are a few other general intros (1, 2, 3). Luis Miguel Morillas pointed out some for GNOME, KDE and wxPython (1, 2).

Python has the usual set of gettext-based facilities, which is great, but they are woefully documented. In the library reference's section on gettext you get a purported overview and a bunch of scattered notes on the API, but nothing that really coherently leads the developer through the concepts and process of i18n, as the PHP and Ruby folks seem to have. It doesn't even seem as if buying a book helps much. The books I most often recommend as Python intros don't seem to touch the subject, and the reference books seem to go not much deeper than the Python docs. Even David Mertz's useful Text Processing in Python doesn't cover i18n (this surprised me).

My recent foray into i18n actually straddled Python and XML worlds. For XMLers, there are a few handy resources:

XLIFF is of particular interest, but I decided not to use it for i18n in 4Suite/XSLT because I wanted to base my work on the Python facilities, which are well tested and based on the de facto standard gettext.

Anyway, am I missing something? Are there all sorts of great resources out there that would slide Python developers right into the i18n groove?

[Uche Ogbuji]

via Copia

i18n for XSLT in 4Suite

Prodded by discussion on the CherryPy list I have implemented and checked in a 4Suite XSLT extension for internationalization using Python's gettext facilities for the underlying support. Here is how it works. Sample XML:

<doc>
  <msg>hello</msg>
  <msg>goodbye</msg>
</doc>

Sample XSLT:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:f="http://xmlns.4suite.org/ext"
  extension-element-prefixes="f"
>
  <f:setup-translations domain="test"/>
  <xsl:template match="msg">
    <f:gettext><xsl:apply-templates/></f:gettext>
  </xsl:template>
</xsl:stylesheet>

The f:setup-translations and f:gettext extension elements are the key. The former looks up and installs the domain "test" for use in your XSLT. Replace it with the domain used by your application. The latter extension evaluates its body to get a string value, and then looks up this string in the installed translation.

Assuming you have a test.mo installed in the right place, say that translates "hello" to "howdy" and "goodbye" to "so long".

$ 4xslt test.xml test.xsl
<?xml version="1.0" encoding="UTF-8"?>
howdy
  so long

I trimmed some white space for formatting, but you get the idea. The translations are applied automatically according to your locale.

This operates via Python's gettext facilities which means that it's much more efficient than, say, the docbook XSLT approach to i18n.

For those who want to give it a whirl, here's a quick step-by-step. All the needed files are available here on copia.

Create a sandbox locale directory:

mkdir -p /tmp/locale/en_US/LC_MESSAGES/

Copy in the catalog. You may need to create a different catalog for your own language if your system will not be selecting en_US as locale (remember that you can hack the locale via the environment)

cp en_US.mo /tmp/locale/en_US/LC_MESSAGES/test.mo

Your locale is probably not en_US. If not, you can:

  • temporarily override your locale to en_us using export LANG=en_US, or the equivalent command for your shell
  • create translations for your locale (just two strings to translate). I use poedit, which is makes dealing with .pos simple enough. Then replace en_US in all the above instructions with your own locale and the .mo file you created.

Anyway, the f:setup-translations and f:gettext extensions are now checked into 4Suite. You can either update to current 4Suite CVS, or just download the one changed file, Ft/Xml/Xslt/BuiltInExtElements.py and copy it into your 4Suite codebase. It works fine as a drop-in to 4Suite 1.0b1.

[Uche Ogbuji]

via Copia

Test post from Drivel

This is the first attempt to post from the new Drivel 2.0 weblogging tool and it will probably be the last. For one thing, Drivel doesn't seem to support MetaWeblog, just Blogger 1.0 (I hope PyBlosxom gets Atom API supporrt soon). And then when I started it up for the first time I found myself staring at a "Sending/Receiving...Retrieving journal entries..." app-modal dialog for over three minutes while the progress indicator crawled along. Doesn't exactly fill me with confidence. I've got into such a groove posting via e-mail that I'd have to be wowed from the get-go before I make a switch at this point. Maybe Drivel 3.0?

[Uche Ogbuji]

via Copia

New life for PyXPCOM?

Way back in the day I wrote about PyXPCOM, a means for using Python to script Mozilla browser. and the project had a lot of promise.

Mark Hammond was the considerable brains behind PyXPCOM, as well as the Win32 and .NET APIs through Python, and many other things. Indeed, he received the 2003 ActiveState Active Award in Python (the same year Mike Olson and I got one for XSLT). Unfortunately, he has been way below the radar for the bast couple of years, and no one has really picked up the torch on PyXPCOM. The project has been largely languishing for so long that it was quite exciting to see Brendan Eich, keeper of the Mozilla roadmap, including "Mozilla 2.0 platform must-haves":

8. Python support, perhaps via Mono (if so, along with other programming languages).

I'm not sure just how Mono would fit in. Would they build a little CLR sandbox into Mozilla so that Python.NET code could run?

Anyway, if you care about being able to script Mozilla through Python (and I think you should), please leave a comment on Brendan's article. Here's a note about some of the comments already in place on the matter:

#8 scares me only for the potentially huge installer file. If it were optional this would be incredibly cool. If it were optional developers would have a headache.

I think it should be enough for Mozilla to include the PyXPCOM stubs, and use the user's own installed Python, which should alleviate this fear.

Hmm. What do you think about Parrot (Perl 6) support? Soon, Parrot will be something like [stable], and the hope is that it will support a lot of languages, includes Python. I would give it a chance, sounds good.

From what I've followed about Parrot and its intended use as a basis for other languages such as Python, I'm not comfortable with such an approach.

Python support can be provided via Jython which is much older than the .NET python implementation.

It seems people want to offer up every VM incarnation on the planet as a possible base for Mozilla/Python, but I'm spoiled by the potential I saw through Hammond's work, and I really would want the project to at least try picking up from there. I was therefore glad to see Brendan's response:

We already have Python integrated with XPCOM, thanks to Mark Hammond and Active State. If nothing better comes along in the way of a unified runtime, we will fully integrate Mark's work so you can write <script type="application/x-python"> in XUL.

Whether Python support will be bundled in libxul or not, I'm pushing for a scheme that lets extension languages be loaded dynamically. So if you have connectivity or can deploy an extra file, you should be able to use Python as well as JS from XUL. That's my goal, at least.

See my next entry for the Mozilla 2.0 "managed code" virtual machine goals that any would-be universal runtime has to meet, or come close to meeting, to win.

This sounds just right, and I'll keep my eye open for the follow-up article he mentions. Another poster mentions:

I would like to see the ability to talk to Mozilla from outside Python code. A program I am writing allows importing contacts from various data sources. I can do Outlook and Evolution easily, but have given up on Mozilla contacts.

In theory I need to use XPCom with the PyXPCom wrapper but I challenge anyone to actually get that working on Windows, Linux and Mac and have a redistributable program. (There are no binaries of PyXPCom for example).

Yes, PyXPCOM does allow this in theory, and i think Brendan's entire point is that it's important for Mozilla developers to put in the work to address the problem stated in the second paragraph.

If you're trying to work with PyXPCOM, keep an eye on the mailing list. Folks have been posting their problems, and others have been sharing their recipes for getting PyXPCOM to work, including Matt Campbell and Scott Robertson and Jean-François Rameau, and Michael Thornhill (1 2).

[Uche Ogbuji]

via Copia

Pythonic SPARQL API over rdflib

I've recently been investigating the possiblity of adapting an existing SPARQL parser/query engine on top of 4RDF - mostly for the eventual purpose of implementing a sparql-eval Versa extension function - was pleased to see there has already been some similar work done:

Although this isn't exactly what I had in mind (the more robust option would be to write an adaptor for Redland's model API and execute SPARQL queries via rasqal ), it provides an interesting pythonic analog to querying RDF.

Chimezie Ogbuji

via Copia

Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples

Since seeing Mike Kay's presentation at XTech 2005 I've been meaning to write up some Amara equivalents to the examples in the paper, "Comparing XSLT and XQuery". Here they are.

This is not meant to be an advocacy piece, but rather a set of useful examples. I think the Amara examples tend to be easier to follow for typical programmers (although they also expose some things I'd like to improve), but with XSLT and XQuery you get cleaner declarative semantics, and cross-language support.

It is by no means always true that an XSLT stylesheet (whether 1.0 or 2.0) is longer than the equivalent in XQuery. Consider the simple task: create a copy of a document that is identical to the original except that all NOTE attributes are omitted. Here is an XSLT stylesheet that does the job. It's a simple variation on the standard identity template that forms part of every XSLT developer's repertoire:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@* except @NOTE"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

In XQuery, lacking an apply-templates instruction and built-in template rules, the recursive descent has to be programmed by hand:

declare function local:copy($node as element()) {
  element {node-name($node)} {
    (@* except @NOTE,
    for $c in child::node
    return typeswitch($c) 
      case $e as element() return local:copy($a)
      case $t as text() return $t
      case $c as comment() return $c
      case $p as processing-instruction return $p
  }
};

local:copy(/*)

Here is Amara code to do the same thing:

def ident_except_note(doc):
    for elem in doc.xml_xpath(u'//*[@NOTE]'):
        del elem.NOTE
    print doc.xml()

Later on in the paper:

...nearly every FLWOR expression has a direct equivalent in XSLT. For example, to take a query from the XMark benchmark:

for    $b in doc("auction.xml")/site/regions//item
let    $k := $b/name
order by $k
return <item name="{$k}">{ $b/location } </item>

is equivalent to the XSLT code:

<xsl:for-each select="doc('auction.xml')/site/regions//item">
  <xsl:sort select="name"/>
  <item name="{name}"
     <xsl:value-of select="location"/>
  </item>
</xsl:for-each>

In Amara:

def sort_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    items = doc.xml_xpath(u'/site/regions//item')
    items.sort()
    for item in items:
        newdoc.xml_append(
            newdoc.xml_element(u'item', content=item)
        )
    newdoc.xml()

This is the first of a couple of examples from XMark. To understand the examples more fully you might want to browse the paper, "The XML Benchmark Project". This was the first I'd heard of XMark, and it seems a pretty useful benchmarking test case, except that it's very heavy on records-like XML (not much on prosy, narrative documents with mixed content, significant element order, and the like). As, such I think it could only ever be a sliver of one half of any comprehensive benchmarking framework.

I think the main thing this makes me wonder about Amara is whether there is any way to make the element creation API a bit simpler, but that's not a new point for me to ponder, and if I can think of anything nicer, I'll work on it post 1.0.

Kay's paper next takes on more complex example from XMark: "Q9: List the names of persons an the names of items they bought in Europe". In database terms this is a joins across person, closed_auction and item element sets. In XQuery:

for $p in doc("auction.xml")/site/people/person
let $a := 
   for $t in doc("auction.xml")/site/closed_auctions/closed_auction
   let $n := for $t2 in doc("auction.xml")/site/regions/europe/item
                       where  $t/itemref/@item = $t2/@id
                       return $t2
       where $p/@id = $t/buyer/@person
       return <item> {$n/name} </item>
return <person name="{$p/name}">{ $a }</person>

Mike Kay's XSLT 2.0 equivalent.

<xsl:for-each select="doc('auction.xml')/site/people/person">
  <xsl:variable name="p" select="."/>
  <xsl:variable name="a" as="element(item)*">
    <xsl:for-each 
        select="doc('auction.xml')/site/closed_auctions/closed_auction">
      <xsl:variable name="t" select="."/>
      <xsl:variable name="n" 
           select="doc('auction.xml')/site/regions/europe/item
                               [$t/itemref/@item = @id]"/>
      <xsl:if test="$p/@id = $t/buyer/person">
        <item><xsl:copy-of select="$n/name"/></item>
      </xsl:if>
  </xsl:variable>
  <person name="{$p/name}">
    <xsl:copy-of select="$a"/>
  </person>
</xsl:for-each>

In Amara:

def closed_auction_items_by_name():
    doc = binderytools.bind_file('auction.xml')
    newdoc = binderytools.create_document()
    #Iterate over each person
    for person in doc.xml_xpath(u'/site/people/person'):
        #Prepare the wrapper element for each person
        person_elem = newdoc.xml_element(
            u'person',
            attributes={u'name': unicode(person.name)}
        )
        newdoc.xml_append(person_elem)
        #Join to compute all the items this person bought in Europe
        items = [ unicode(item.name)
          for closed in doc.xml_xpath(u'/site/closed_auctions/closed_auction')
          for item in doc.xml_xpath(u'/site/regions/europe/item')
          if (item.id == closed.itemref.item
              and person.id == closed.buyer.person)
        ]
        #XML chunk with results of join
        for item in items:
            person_elem.xml_append(
                newdoc.xml_element(u'item', content=item)
            )
    #All done.  Print out the resulting document
    print newdoc.xml()

I think the central loop in this case is much clearer as a Python list comprehension than in either the XQuery or XSLT 2.0 case, but I think Amara suffers a bit from the less literal element creation syntax, and for the need to "cast" to Unicode. I would like to lay out cases where casts from bound XML structures to Unicode make sense, so I can get user feedback and implement accordingly. Kay's final example is as follows.

The following code, for example, replaces the text see [Kay, 93] with see Kay93.

<xsl:analyze-string select="$input" regex="\[(.*),(.*)\]">
<xsl:matching-substring>
  <citation>
    <author><xsl:value-of select="regex-group(1)"/></author>
    <year><xsl:value-of select="regex-group(2)"/></year>
  </citation>
</xsl:matching-substring>
<xsl:non-matching-substring>
  <xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>

The only way of achieving this transformation using XQuery 1.0 is to write some fairly convoluted recursive functions.

Here is the Amara version:

import re
PATTERN = re.compile(r'[(.*),(.*)]')
def repl_func(m):
    citation = doc.xml_element(u'item')
    citation.xml_append(doc.xml_element(u'author', content=m.group (1)))
    citation.xml_append(doc.xml_element(u'year', content=m.group (2)))
    return citation.xml(omitXmlDeclaration=u'yes')
text = u'see [Kay, 93]'
print PATTERN.subn(repl_func, text)

I think this is very smooth, with the only possible rough patch again being the output generation syntax.

I should mention that Amara's output syntax isn't really bad. It's just verbose because of its Python idiom. XQuery and XSLT have the advantage that you can pretty much write XML in-line into the code (the templating approach), whereas Python's syntax doesn't allow for this. There has been a lot of discussion of more literal XML template syntax for Python and other languages, but I tend to think it's not worth it, even considering that it would simplify the XML generation syntax of tools such as Amara. Maybe it would be nice to have a lightweight templating system that allows you to write XSLT template chunks in-line with Amara code for data processing, but then, as with most such templating systems, you run into issues of poor model/presentation separation. Clearly this is a matter for much more pondering.

[Uche Ogbuji]

via Copia

Python community: XIST

XIST 2.10

XIST (simple, free license) is a very capable package for XML and HTML processing and generation. In the maintainers' words:

XIST is an extensible HTML/XML generator written in Python. XIST is also a DOM parser (built on top of SAX2) with a very simple and Pythonesque tree API. Every XML element type corresponds to a Python class, and these Python classes provide a conversion method to transform the XML tree (e.g. into HTML). XIST can be considered "object oriented XSL".

I covered it recently in "Writing and Reading XML with XIST". There are some API tweaks and bug fixes as well as some test suite infrastructure changes. The full, long list of changes is given in Walter Dörwald's announcement.

[Uche Ogbuji]

via Copia

Scattered notes from XTech

XTech 2005. Amsterdam. Lovely time. But first of all, I went for a conference. Edd Dumbill outdid himself this time. The first coup de maître was sculpting the tracks to increase the interdisciplinary energy of the meet. The browser track brought out a lot of new faces and provided a jolt of energy. There did seem to be a bit of a divide between the browser types and the XML types, but only as much as one would expect from the fact that XML types tend to know each other, and ditto browser types. There was plenty of crosstalk between the disciplines as well.

Second touch: focus on open data, and all the excitement in that area (Creative Commons, remixing/mash-ups, picture sharing, multimedia sharing, microformats, Weblogging, content syndication, Semantic technology, podcasting, screencasting, personal information spaces, corporate info spaces, public info spaces, etc.) and watch the BBC take over (with they bad selves). And don't fret: "damn, maybe we should lighten up on the BBC bias it he speakers". No, just go with it. Recognize that they are putting forth great topics, and that everyone is amped about how the BBC is leading the way on so many information technology and policy fronts.

Third touch: foster collaboration. Put up a Wiki, encourage folks to an IRC channel, aggregate people's Weblog postings and snapshots into one place, Planet XTech, and cook up a fun little challenge to go with the theme of open data. For that last bit Edd put out an XML representation of the conference schedule and asked folks to do something cool with it. I didn't do as much with it as I'd hoped. When I finally got my presentation going I used the posted grid.xml as a demo file for playing with Amara, but I wished it had more content, especially mixed content (it's very attribute heavy). I've suggested on the XTech Wiki that if Edd does the same thing next time, that he work in paper abstracts, or something like that, to get in more text content.

I said "When I finally got my presentation going", which hints at the story of my RAI (venue for XTech) jinx. Last year in Amsterdam I couldn't get my Dell 8600 running Fedora Core 3 to agree with the projectors at the RAI. As Elliotte Rusty Harold understates in his notes from the 2004 conference:

After some technical glitches, Uche Ogbuji is talking about XML good practices and antipatterns in a talk entitled "XML Design Principles for Form and Function"

In fact I ended up having to switch to OpenOffice on Windows, and the attendees endured a font only a hippie could love (Apparently Luxi Sans is not installed by default on Windows and OO/Win has a very strange way of finding a substitute). I'm vain enough not to miss quoting another bit about my talk from Elliotte:

A very good talk. I look forward to reading the paper. FYI, he's a wonderful speaker; probably the best I've heard here yet.

Gratifying to know I managed a good talk under pressure. I hope I did so again this time, because the RAI projectors were no more friendly. The topic was "Matching Python idioms to XML idioms". Remembering the last year's headache I asked about a projector to use to test out my presentation (I was on the first day, Weds). Usually conference speakers' rooms have a spare projector for this purpose, but it looks as if the RAI couldn't supply one. I crossed my fingers and arrived for my talk the dutiful 15 minutes early. Eric van der Vlist was up before me in the block. The AV guy came along and they spent quite a while struggling with Eric's laptop (Several speakers had trouble with the RAI projectors). They finally worked out a 640x480 arrangement that caused him to have to pan around his screen. This took a while, and the AV guy bolted right afterward and was not there to help me set up my own laptop. Naturally, neither I nor the very helpful Michel Biezunski (our session chair) were able to get it to work, and we had to turn things over to Eric to start his talk.

We then both went in search of the AV guy, and it took forever to find him. No, they didn't have a spare projector that we could use to set up my laptop in time for my talk. We'd just have to wait for Eric to finish and hope for the best (insert choice sailor's vocabulary here). My time slot came and we spent 20 minutes trying every setting on my laptop and every setting on their projector. The AV guys (yeah, when it was crisis time, they actually found they had more than one) muttered taunts about Linux, and it's a lucky thing I was bent on staying calm. I present quite often, and I do usually have to try out a few settings to get things to work, but in my encounters it's only the RAI projectors that seem completely incapable to project from my Linux laptop. In all, I witnessed 4 speakers (3 on Linux and surprisingly one on Mac OS X) who had big problems with the RAI projectors, including one of the keynote speakers. I suspect others had problems as well.

I couldn't take the obvious escape route of borrowing someone else's laptop because the crux of my talk was a demo of Amara and I'd have to install all that as well (Several kind volunteers including Michel had 4Suite installed, but not Amara). After 20 minutes, we agreed that I'd go on with my talk on Michel's computer (Thinkpad running Red Hat 9 and it worked with the projector immediately!), skip the demo, and we'd find another time slot for me to give the entire talk the next day. Quite a few people stuck around through this mess and I'm grateful to them.

The next day we installed Amara on Michel's computer and I gave the presentation in its proper form right after lunch. There was great attendance for this reprise, considering everything. The Amara demo went fine, except that the grid.xml I was using as a sample gave too few opportunities to show off text manipulation. I'll post a bit later on thoughts relating to Amara, stemming from the conference. Norm Walsh was especially kind and encouraging about my presentation woes, and he has also been kind in his notes on XTech 2005:

The presentation [deities] did not smile on Uche Ogbuji. He struggled mightily to get his presentation of Matching Python Idioms to XML Idioms off the ground. In vain, as it turned out (AV problems were all too common for a modern conference center), but he was generous enough try again the next day and it was worth it (thanks Uche!). I'm slowly becoming a Python convert and some of the excellent work that he's done at Fourthought to provide Python access to standard XML in ways that feel natural in Python is part of the appeal.

That's the precise idea. A tool for processing XML that works for both Python heads and XML heads. The whole point of my presentation was how hard this is to accomplish, and how just about every Python tool (including earlier releases of 4Suite) accommodates one side and not the other. The response to Amara from both the Python heads and XML heads makes me feel I've finally struck the right balance.

I got a lot out of the other XTech talks. Read Norm on the keynotes: he pretty much had the same impressions as I did. Props to Michael Kay for his great presentation comparing XSLT 2.0 and XML Query. I took enough notes at that one for a separate entry, which will follow this one. I missed a lot of the talks between Kay's and my own while I was trying (unsuccessfully) to head off the AV gremlins.

Other talks to highlight: Jon Trowbridge's on Beagle (who, you guessed it, had AV problems that ate up a chunk of his time slot). From the project Wiki:

Beagle is a search tool that ransacks your personal information space to find whatever you're looking for. Beagle can search in many different domains: documents, emails, web history, IM/IRC conversation, source code, images, music files, applications and [much more]

Edd had already introduced me to Beagle, but it was really cool to see it in action. I'll have to check it out. Jon also pointed out TomBoy, "a desktop note-taking application for Linux and Unix. Simple and easy to use, but with potential to help you organize the ideas and information you deal with every day." Two projects I'll have to give a spin. Props to Jon for shrugging off the AV woes and giving a fun and relaxed talk.

Robert O'Callahan's talk on the new canvas tag for Mozilla and Safari was memorable if for nothing else than the experience of surfing Google at a 45° angle, with no apparent loss in snappiness. This canvas thingie looks wicked cool, and it's good to see them working to incorporate SVG. I've heard a lot of grumbling from W3C types about canvas, and all we poor browser users in the middle can hope for is some rapid conversion of cool technologies such as XAML, XUL, canvas, SVG, etc. Others have blogged about the opportunities and anxieties opened up by the WHATWG, which one commentator said should have been the "WHAT Task Force" because "WTF" would have been a better acronym. I'm a neutral in these matters, except that I really do with browser folks would do what they can to push people along to XHTML 2.0 rather than cooking up HTML 5.0 and such.

Matt Biddulph was one of the BBC Massive on hand, and his talk "The Application of Weblike Design to Data - Designing Data for Reuse" offered a lot of practical tips on how to usefully open up a large body of data from a large organization.

Dominique Hazaël-Massieux gave a talk on GRDDL (O most unfortunate project name), which was my first hearing of the technology. My brief characterization of GRDDL is as an attempt to pull the Wild West ethos of microformats into the rather more controlled sphere of RDF. It touches on topics in which I've been active for years, including tools for mapping XML to RDF. I've argued all these years that RDF folks will have to embrace general XML in place of the RDF/XML vocabulary if they are to make much progress. They will have to foster tools that make extracting RDF model data from XML a no-brainer. It's great to see the W3C finally stirring in this direction. Dom's presented very well. I asked about the use of other systems, such as schema annotation, for the XML to RDF mapping. It seemed to me that GRDDL was mostly geared towards XSLT. Dom said it is meant to be independent of the mapping mechanism, but in my post-conference reading I'm not so sure. I'll have to ponder this matter more and perhaps post my thoughts. Dom also mentioned PiggyBank, "the Semantic Web extension for Firefox". Kingsley Idehen has a nice blurb on this software. I do hesitate to use it because someone mentioned to me how PiggyBank had gone into crazy thrash mode at one point. I don't muck with my FireFox set-up lightly.

Rick Jelliffe showed off Topologi's lightweight browser TreeWorld, which is XML-oriented and suitable for embedding into other applications.

Others have blogged Jean Paoli's closing keynote (http://glazman.org/weblog/dotclear/index.php?2005/05/29/1059-adam-3">Leigh, etc.). Seems I'm not the only one who was put off by the straight-up product pitch. At least he did a bit of a service by clearly saying "Binary XML: No please". Check out more quotes from XTech.

The conference was superb. Do be sure not to miss it next year. It's looking like Amsterdam will be the venue again. And what of Amsterdam? Besides the conference I had a great time with friends. I'll post on that later.

For the most comprehensive report I've seen to date, see Micah Dubinko's article.

[Uche Ogbuji]

via Copia