Planet Atom

Planet Atom is now live.

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, and John L. Clark. Please visit the Planet Atom development site if you are interested in the source code. The complete list of sources is maintained in XBEL format (with some experimental extensions); please contact one of the site developers if you want to suggest a modification to this list.

John, Sylvain and I have been working at this on and off for over a month now (we've all been swamped with other things—the actual development of the site was fairly straightforward). Planet Atom is built on an aggregation from Atom 1.0 feeds into one larger feed (with entries collated, trimmed etc.) It's built on 4Suite (for XSLT processing), CherryPy (for Web serving), Amara (for Atom feed slicing and dicing), atomixlib (for building the aggregate feed) and dateutil (for date wrangling), with Python and XML as the twin foundations, of course. Thanks to folks on the #atom and #swhack IRC channels for review and feedback.

[Uche Ogbuji]

via Copia

Recipe: fast scan of an XML file for one field

If you have a huge XML file and you need to grab the first instance of a particular field in a fast and memory efficient manner, a simple one-liner with Amara's pushbind does the trick.

val = unicode(amara.pushbind("book.xml", "title").next())

Returns the text of the first title element in a book.xml (which could be Docbook or any other format with a title element), loading hardly any of the file into memory. It also doesn't parse the file beyond the target element. It would be a shade slower to get such an element at the end of a file. For example, the following line gets the title of a Docbook index.

val = unicode(amara.pushbind("book.xml", "index/title").next())

Even when finding an element near the end of a file it's very fast. All my use cases were pretty much instantaneous working with a 150MB file (I'm working on convincing the client to avoid such huge files).

If the target element is not found, you'll get a StopIteration exception.

[Uche Ogbuji]

via Copia

The uneven state of Schematron

It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.

I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:

Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.

G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.

[Uche Ogbuji]

via Copia

timer.py, a specialization of timeit

Most Pythoneers are familiar with the very handy timeit module. It's a great way to compare Python idioms for performance. I tend to use it from the command line, as in the following.

$ python -m timeit "''.join([ str(i*10000) for i in xrange(100) ])"
10000 loops, best of 3: 114 usec per loop

You can use this method to time multi-line code as well, using multiple command line quoted arguments.

$ python -m timeit "s = ''" "for i in xrange(100):" "    s += str(i*10000)"
1000 loops, best of 3: 351 usec per loop

The python -m trick is new in Python 2.4. Notice the indentation built into the third argument string.

As you can imagine, this quickly becomes cumbersome, and it would be nice to have a way to perform such timings on proper script files without too much fiddling.

Jeremy Kloth scratched that itch, coming up with timer.py. I bundle it in the test directory of Amara, but you can also grab it directly from CVS.

You can run it on a script, putting the logic to be timed into a main function. The rest of the script's contents will be treated as set-up and not factored into the timings.

$ cat /tmp/buildstring.py
import cStringIO

def main():
    s = cStringIO.StringIO()
    for i in xrange(100):
        s.write(str(i*10000))
$ python timer.py /tmp/buildstring.py
1000 loops, best of 3: 444 usec

timer.py uses the basic logic from timeit. It tries to keep the running time between 0.2 and 2 secs.

[Uche Ogbuji]

via Copia

Amara API quick reference, and Windows packages

I forgot to mention in the Amara 1.1.6 announcement that I drafted an API quick reference. I've put a link to it on the Amara home page.

I've also added a Windows installer created by Sylvain Hellegouarch, with some help from Jeremy Kloth. It's an installer for Amara "allinone", so all you need is to have installed Python 2.4 for Windows, then you run this installer, and you should be all set.

[Uche Ogbuji]

via Copia

Amara 1.1.6

I released Amara 1.1.6 last week (see the announcement). This version requires 4Suite XML 1.0b2. As usual, though, I have prepared an "allinone" package so that you do not need to install 4Suite separately to use Amara.

The biggest improvements in ths release are to performance and to the API. Amara takes advantage of a lot of the great performance work that has gone into 4Suite (e.g. Saxlette). There is also a much easier API on-ramp that I expect most users will appreciate. Rather than having to parse using:

from amara import binderytools as bt
doc = bt.bind_string(XML) #or bt.bind_uri or bt.bind_file or bt.bind_stream

You can use

import amara
amara.parse(XML) #Whether XML is string, file-like object, URI or local file path

There are several other such simplifications. There is also the xml_append_template facility, which is very handy for generating XML (see how Sylvain uses it to simplify atomixlib).

Thanks to all the folks who helped with suggestions, patches, review, etc.

[Uche Ogbuji]

via Copia

Cop it while it's hot: 4Suite XML 1.0b2

Updated with working link for the manual

We've announced 4Suite XML 1.0b2 We're a big step towards a 1.0 release, even bigger than most of our releases because what we've done with this is trim the overall 4Suite package to a sensible size and scope. This release contains only the XML processing core and some support libraries. It does not contain the RDF libraries and the repository. This does not mean those components are stranded (see, for example, the rdflib/4RDF merger effort for a sense of the new juice being fed into 4Suite/RDF). It's just that the core XML libraries are so much more mature than the other components, and so much more widely used, that it made no sense not to set it free and let it quickly march to 1.0 on its own. This release features some serious performance improvements, some simplified APIs for a gentler user learning curve, and a lot of fixes and other improvements (see the announcement for the full, long list).

In fact, the code we released is just about 1.0 in itself, as far as the XML component goes. A code freeze is in place, and we'll focus on fixing bugs and wrapping up the user manual effort. (BTW, if you'd like to help chip into the manual, please say so on the 4Suite-dev mailing list; there is a lot of material in place, and what we need is mostly in the way of editing and improving details). Our plan is to get the XML core to 1.0 more quickly than we would have been able before breaking 4Suite into components, and then we can focus on RDF and the repository. 4Suite/RDF will probably disappear into the new rdflib, and the repository will probably go through heavy refactoring and simplification.

Today, after some day-job tasks, my priority will be getting Amara 1.1.5 out. It's been largely ready and waiting for the 4Suite release. Some really sweet improvements and additions in this Amara release (though I do say so myself). More on that later.

[Uche Ogbuji]

via Copia

Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia

More on the PyBlosxom del.icio.us plug-in, and introducing task_control.py, a a pseudo cron plug-in for PyBlosxom

Micah put my del.icio.us daily links tool to immediate use on his blog. He uncovered a bug in the character handling, which is now fixed in the posted amara_delicious.py file.

I usually invoke the script from cron, but Micah asked if there was an alternative. I've been meaning to hack up a poor man's cron for PyBlosxom and this gave me an additional push. The result is task_control.py.

A sort of poor man's cron for PyBlosxom, this plug-in allows you to specify tasks (as Python scripts) to be run only at certain intervals Each time the plug-in is invoked it checks a set of tasks and the last time they were run. It runs only those that have not been run within the specified interval.

To run the Amara del.icio.us daily links script once a day, you would add the following to your config file:

py["tasks"] = {"/usr/local/bin/amara_delicious.py": 24*60*60}
py["task_control_file"] = py['datadir'] + "/task_control.dat"

You could of course have multiple script/interval mappings in the "tasks" dict. The scripts are run with variables request and config set, so, for example, if running from task_control.py, you could change the line of amara_delicious.py from

BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'

to

BASEDIR = config['datadir']

[Uche Ogbuji]

via Copia