Help needed with Python/SAX implementation comparison

If you want to help clarify the implementation differences across Python/SAX implementations and you are familiar with XSLT, please respond to Mike Brown's request for help. Andrew Clover's similar work for Python DOM implementations has proved a very useful resource, and it would be nice to have the same for SAX. Mike has done the hard part. He just needs someone to carry it across the finish line. I did discuss some of the unfortunate Python/SAX confusion in "Practical SAX Notes". A tabular analysis would be a nice addition to that discussion.

[Uche Ogbuji]

via Copia

I already said OPML is crap, right? I had to hack through another reminder today.

So today I tried to import OPML (yeah, that very OPML) into Findory (see last entry). The OPML is based on what I originally exported from Lektora and has been through all my feed experiments. A sample entry:

<outline url="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>

What does Findory tell me? 97 feeds rejected for "invalid source". Great. Now I actually have to get my hands dirty in OPML again. I check the spec. Of course there's no useful information there. I eventually found this Wiki of OPML conventions. I saw the type='rss' convention, but that didn't seem to make a difference. I also tried xmlUrl rather than url, like so:

<outline xmlUrl="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>

This time the Findory import works.

But not only do several of the feed readers I use have url rather than xmlUrl, but the XBEL to URL XSLT I've found assumes that name as well. The conventions page also mentions title versus text as a way to provide formatting in some vague way, but I've seen OPML feeds use only title and nary a text to be seen anywhere. Besides, what's wrong with the XML way of allowing formatting: elements rather than attributes. It's enough to boil the brain.

Speaking of XBEL, that's actually how I'm managing my feeds now, as I'll discuss further in the next entry. Now that Web feeds have become important to me I'll be using a sane format to manage them, thank you very much. I'll do the XSLT thing to export OPML for all the different tools that insist on tag soup. That is, of course, if I can figure out what precise shade of OPML each tool insists on. Today's adventure with feed URL attributes makes me wonder whether there is any escaping the chaos.

[Uche Ogbuji]

via Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

The uneven state of Schematron

It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.

I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:

Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.

G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.

[Uche Ogbuji]

via Copia

timer.py, a specialization of timeit

Most Pythoneers are familiar with the very handy timeit module. It's a great way to compare Python idioms for performance. I tend to use it from the command line, as in the following.

$ python -m timeit "''.join([ str(i*10000) for i in xrange(100) ])"
10000 loops, best of 3: 114 usec per loop

You can use this method to time multi-line code as well, using multiple command line quoted arguments.

$ python -m timeit "s = ''" "for i in xrange(100):" "    s += str(i*10000)"
1000 loops, best of 3: 351 usec per loop

The python -m trick is new in Python 2.4. Notice the indentation built into the third argument string.

As you can imagine, this quickly becomes cumbersome, and it would be nice to have a way to perform such timings on proper script files without too much fiddling.

Jeremy Kloth scratched that itch, coming up with timer.py. I bundle it in the test directory of Amara, but you can also grab it directly from CVS.

You can run it on a script, putting the logic to be timed into a main function. The rest of the script's contents will be treated as set-up and not factored into the timings.

$ cat /tmp/buildstring.py
import cStringIO

def main():
    s = cStringIO.StringIO()
    for i in xrange(100):
        s.write(str(i*10000))
$ python timer.py /tmp/buildstring.py
1000 loops, best of 3: 444 usec

timer.py uses the basic logic from timeit. It tries to keep the running time between 0.2 and 2 secs.

[Uche Ogbuji]

via Copia

Amara API quick reference, and Windows packages

I forgot to mention in the Amara 1.1.6 announcement that I drafted an API quick reference. I've put a link to it on the Amara home page.

I've also added a Windows installer created by Sylvain Hellegouarch, with some help from Jeremy Kloth. It's an installer for Amara "allinone", so all you need is to have installed Python 2.4 for Windows, then you run this installer, and you should be all set.

[Uche Ogbuji]

via Copia

"Tip: Use the right pattern for simple text in RELAX NG"

"Tip: Use the right pattern for simple text in RELAX NG"

The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.

Several times while working on RELAX NG in mentoring roles with clients I've had to explain some of the nuances in the various ways to express simple text patterns. In this article I lay out some of the most common distinctions I've had to make. I should say that much of what I know about RELAX NG nuances I learned from Eric van der Vlist and a lot of that wisdom is gathered in his RELAX NG book (in print or online). I recommend the print book because it has some nice additions not in the online version, and because Eric deserves to eat.

[Uche Ogbuji]

via Copia

Amara 1.1.6

I released Amara 1.1.6 last week (see the announcement). This version requires 4Suite XML 1.0b2. As usual, though, I have prepared an "allinone" package so that you do not need to install 4Suite separately to use Amara.

The biggest improvements in ths release are to performance and to the API. Amara takes advantage of a lot of the great performance work that has gone into 4Suite (e.g. Saxlette). There is also a much easier API on-ramp that I expect most users will appreciate. Rather than having to parse using:

from amara import binderytools as bt
doc = bt.bind_string(XML) #or bt.bind_uri or bt.bind_file or bt.bind_stream

You can use

import amara
amara.parse(XML) #Whether XML is string, file-like object, URI or local file path

There are several other such simplifications. There is also the xml_append_template facility, which is very handy for generating XML (see how Sylvain uses it to simplify atomixlib).

Thanks to all the folks who helped with suggestions, patches, review, etc.

[Uche Ogbuji]

via Copia

Agile Web #1: "Google Sitemaps"

"Google Sitemaps"

Uche Ogbuji's new XML.com column, "Agile Web," explores the intersection of agile programming languages and Web 2.0. In this first installment he examines Google's Sitemaps schema, as well as Python and XSLT code to generate site maps. [Oct. 26, 2005]

And with this article the "Python and XML" column has been replaced by a new one titled "Agile Web".

I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.

In this first article I introduce the Google SiteMaps program, XML format and Python tools.

[Uche Ogbuji]

via Copia