Copia

Agile Web #3: "Scripting Flickr with Python and REST"

In his latest Agile Web column, Uche Ogbuji shows us how to use Python to interact with Flickr as a lightweight web service.

This Agile Web installment is fairly straightforward. I look at the several Python libraries for accessing Flickr from programs. They range from low level, thin veneers over the official Flickr API to the one higher level, more Pythonic library. And of course there's the obligatory package I just can't get to work.

[Uche Ogbuji]

via Copia

Merging Atom 1.0 feeds with Python

mergeatom.py

At the heart of Planet Atom is the mergeatom module. I've updated mergeatom a lot since I first released it. It's still a simple Python utility for merging multiple Atom 1.0 feeds into an aggregated feed. Some of the features:

Reads in a list of atom URLs, files or content strings to be merged into a given target document
Puts out a complete, merged Atom document (duplicates by atom:id are suppressed).
Collates the entries according to date, allowing you to limit the total. WARNING: Entries from the original Atom feed may be deleted according to ID duplicate removal or entry count limits.
Allows you to set the sort order of resulting entries
Uses atom:source elements, according to the spec, to retain key metadata from the originating feeds
Normalizes XML namespaces prefixes for output Atom elements (uses atom:*)
Allows you to limit contained entries to a date range
Handles base URIs fixup intelligently (Base URIs on feed elements are) migrated on to copied entries so that contained relative links remain correct

It requires atomixlib 0.3.0 or more recent, and Amara 1.1.6 or more recent

[Uche Ogbuji]

via Copia

Planet Atom

Planet Atom is now live.

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, and John L. Clark. Please visit the Planet Atom development site if you are interested in the source code. The complete list of sources is maintained in XBEL format (with some experimental extensions); please contact one of the site developers if you want to suggest a modification to this list.

John, Sylvain and I have been working at this on and off for over a month now (we've all been swamped with other things—the actual development of the site was fairly straightforward). Planet Atom is built on an aggregation from Atom 1.0 feeds into one larger feed (with entries collated, trimmed etc.) It's built on 4Suite (for XSLT processing), CherryPy (for Web serving), Amara (for Atom feed slicing and dicing), atomixlib (for building the aggregate feed) and dateutil (for date wrangling), with Python and XML as the twin foundations, of course. Thanks to folks on the #atom and #swhack IRC channels for review and feedback.

[Uche Ogbuji]

via Copia

ElementTree in Python stdlib

The big news in Python/XML in December was the checking in of ElementTree into Python's standard library, which means the package will be part of Python 2.5. The python-dev summary has a good account of the move. I'm surprised it took so long. Even though I've been involved in at least one ElementTree versus 4Suite battle (I also wrote the earliest and possibly only comprehensive treatment of ElementTree), I've always appreciated that ElementTree's relative weightlessness makes it a decent option for Pythoneers who really don't care much about all the intricacies of XML and just have to reluctantly deal with a mess of angle brackets. That's the sort of library that ends up in the stdlib, and for good reason. And for sure, developers have needed a stdlib alternative to SAX and DOM for ages.

A very useful side effect of this move is a step towards merging PyXML into Python SVN, and eliminating the _xmlplus hack that was used to have PyXML installs shadow the stdlib originals. XML-SIG has argued about this for years, but there was never a stimulus for anyone to actually get something done about it. To quote from my article "Gems from the Mines: 2002 to 2003".

In early 2003] Martijn Faassen kicked off a long discussion on the future of PyXML by [complaining about the "_xmlplus hack" that PyXML uses to serve as a drop-in replacement for the Python built-in XML libraries. After he reiterated the complaint the discussion turned to a very serious one that underscored the odd in-between status of PyXML, and in what ways the PyXML package continued to be relevant as a stand-alone. Most of these issues are still not resolved, so this thread is an important airing of considerations that affect many Python/XML users to this day.

Congrats to Fredrik on this culmination of his hard work. And here's to the continued diversity of developer options in the very complex field of XML processing.

[Uche Ogbuji]

via Copia

Expat 2.0 (featured in 4Suite CVS)

Expat 2.0 has made it to the world, after a long incubation. Expat is, of course, the very popular XML parser in C originally developed by James Clark. The first I learned of this development was an announcement by Jeremy Kloth. His announcement also mentioned that current 4Suite CVS includes Expat 2.0, and that it's probably the first outside project to do so. In fact, the most recent 4Suite beta release included an Expat CVS snapshot that was for all practical purposes 2.0.

This is a very important milestone as it will allow Expat development to move on to more innovative pastures. And of course it adds the essential—support for AmigaOS (I'm going to get hate mail from my Amiga booster friends from college).

[Uche Ogbuji]

via Copia

Inventing XML for music

I got an interesting message in response to "Learn how to invent XML languages, then do so". Michael Good of Recordare LLC wrote:

I enjoyed your response to Tim Bray's piece on inventing XML languages.

I hope that when listing important XML vocabularies in the future, you will consider including the MusicXML language for music notation:

For 20 years people had tried to invent a better format than MIDI for exchanging music notation between applications. MIDI was not designed for this purpose, even though it was used this way. It could do the bare basics but not much more, and was really inadequate to the task. The two major attempts, NIFF and SMDL, failed in attracting any significant industry support (in SMDL's case, any industry support whatsoever). MusicXML is the first such language to succeed.

MusicXML is supported by over 50 applications, including the market leaders in music notation editors (Finale and Sibelius) and all the major players in music scanning. It has been adopted by commercial and open source projects; by industry developers, hobbyists, and academic researchers; by established products and innovative new applications. Consumers can finally exchange digital sheet music files between applications, and the barriers to entry for innovative new applications in the market has been dramatically lowered (e.g. see the entry of MuseBook, OrganMuse, Notion, and musicRAIN into the market).

I'm sure there are other examples where XML has successfully (not just potentially) broken down barriers between document interchange in specialized fields. But for now, MusicXML is the most dramatic success story I know. I do want to better understand how MusicXML measures up to XML vocabularies in other industries in terms of adoption rate. If you have pointers to other work in this area, anything you could send on would be most appreciated.

This just underscores my first reaction whenever I hear someone discouraging people from inventing XML formats—how can that be the product of anything but the narrowest world view? XML's strength is in providing a syntactic framework that can be used across innumerable domains. There is no reason why a musical XML format should not be as important as , say, Docbook. I don't know anything about XML formats in the music field, but I'm certainly happy to see that there is room in the XML universe for MusicXML as well as Country Dance (Folk Dancing) animation language, to grab another example plucked from XML.com.

I was curious, so I browsed the landscape a bit for Music and XML. There were some interesting nuggets in the usual sea of noise in this SlashDot article on MusicXML, including this comment:

[Don't forget] the archival value of MusicXML -- [people] criticize it for "re-inventing the wheel," but they're only looking at the value for music composers and consumers.

The true value of MusicXML is as a universally understood format for describing musical scores digitally. The music libraries of the future aren't going to be made of paper, don't you think?

This speaks very sensibly to the overall value proposition of XML. The universal syntax allows data formats to evolve that enhance the longevity of stored data. Longevity that comes from transparency. (OK, so you have to have long-lived storage media as well, but that's a different topic). Such longevity is further enhanced by (once again) the closeness of the expression to the domain model.

There are other XML specifications in the area:

Actually, just go straight to the indefatigable Robin Cover on the topic.

[Uche Ogbuji]

via Copia

Learn how to invent XML languages, then do so

There has been a lot of chatter about Tim Bray's piece "Don’t Invent XML Languages". Good. I'm all for anything that makes people think carefully about XML language design and problems of semantic transparency (communicated meaning of XML structure). I'm all for it even though I generally disagree with Tim's conclusions. Here are some quick thoughts on Tim's essay, and some of the responses I've seen.

Here’s a radical idea: don’t even think of making your own language until you’re sure that you can’t do the job using one of the Big Five: XHTML, DocBook, ODF, UBL, and Atom.—Bray

This is a pretty biased list, and happens to make sense for the circles in which he moves. Even though I happen to move in much the same circles, the first thing I'd say is that there could hardly ever be an authoritative "big 5" list of XML vocabs. There is too much debate and diversity, and that's too good a thing to sweep under the rug. MS Office XML or ODF? OAGIS or UBL? RSS 2.0 or Atom? Sure I happen to plump for the latter three, as Tim does, but things are not so clear cut for the average punter. (I didn't mention TEI or DocBook because it's much less of a head to head battle).

I made my own list in "A survey of XML standards: Part 3—The most important vocabularies" (IBM developerWorks, 2004). It goes:

XHTML
Docbook
XSL-FO
SVG
VoiceXML
MathML
SMIL
RDF
XML Topic Maps

And in that article I admit I'm "just scratching the surface". The list predates first full releases of Atom and ODF, or they would have been on it. I should also mention XBEL, which is, I think, not as widely trumpetd, but just about as important as those other entrants. BTW, see the full cross-reference of my survey of XML standards.

Designing XML Languages is hard. It’s boring, political, time-consuming, unglamorous, irritating work. It always takes longer than you think it will, and when you’re finished, there’s always this feeling that you could have done more or should have done less or got some detail essentially wrong.—Bray

This is true. It's easy to be flip and say "sure, that's true of programming, but we're not being advised to write no more programs". But then I think this difficulty is even more true of XML design than of programming, and it's worth reminding people that a useful XML vocabulary is not something you toss off in the spare hour. Simon St.Laurent has always been a sound analyst of the harm done by programmers who take shortcuts and abuse markup in order to suite their conventions. The lesson, however, should be to learn best practices of markup design rather than to become a helpless spectator.

If you’re going to design a new language, you’re committing to a major investment in software development. First, you’ll need a validator. Don’t kid yourself that writing a schema will do the trick; any nontrivial language will have a whole lot of constraints that you can’t check in a schema language, and any nontrivial language needs an automated validator if you’re going to get software to interoperate.

If people would just use decent schema technology, this point would be very much weakened. Schema designers rarely see beyond plain W3C XML Schema or RELAX NG. Too bad. RELAX NG plus Schematron (with XPath 1.0/XSLT 1.0 drivers) covers a huge number of constraints. Add in EXSLT 1.0 drivers for Schematron and you can cover probably 95+% of Atom's constraints (probably more, actually). Throw in user-defined extensions and you have a very powerful and mostly declarative validation engine. We should do a better job of rendering such goodness to XML developers, rather than scaring them away with duct-tape-validator bogeymen.

Yes, XHTML is semantically weak and doesn’t really grok hierarchy and has a bunch of other problems. That’s OK, because it has a general-purpose class attribute and ignores markup it doesn’t know about and you can bastardize it eight ways from center without anything breaking. The Kool Kids call this “Microformats”...

This understated bit is, I think, the heart of Tim's argument. The problem is that I still haven't been able to figure out why Microformats have any advantage in Semantic transparency over new vocabularies. Despite the fuzzy claims of μFormatters, a microformat requires just as much specification as a small, standalone format to be useful. It didn't take me long kicking around XOXO to solve a real-world problem before this became apparent to me.

Some interesting reactions to the piece

Dare Obasanjo. Dare indirectly brought up that Ian Hickson had argued against inventing XML vocabularies in 2003. I remember violently and negatively reacting to the idea that everyone should stick to XHTML and its elite companions. Certainly such limitations make sense for some, but the general case is more nuanced (thank goodness). Side note: another pioneer of the pessimistic side of this argument is Mark Pilgrim http://www.xml.com/pub/au/164. Needless to say I disagree with many of his points as well.

I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan.—Obasanjo

Dare is, I think, about as stubborn and tart as I am, so I'm amazed to see him doubting his convictions in this way. Please don't, Dare. You're quite correct. Microformats are just a hair away from my pet reductio ad absurdum—<tag type="title"> rather than just <title>. I still haven't heard a decent argument for such periphrasis. And I don't see how the fact that tag is semantically anchored does anything special for the stepchild identifier title in the microformats scenario.

BTW, there is a priceless quote in comments to Dare:

OK, so they're saying: don't create new XML languages - instead, create new HTML languages. Because if you can't get people to [separate presentation from data], hijack the presentation!—"Steve"

Wot he said. With bells on.

Danny Ayers .

I think most XML languages have been created by one of three processes - translating from a legacy format; mapping directly from the domain entities to the syntax; creating an abstract model from the domain, then mapping from that to the XML. The latter two of these are really on a greyscale: a language designer probably has the abstract entities and relationships in mind when creating the format, whether or not they have been expressed formally.—Ayers

I've had my tiffs with RDF gurus lately, but this is the sort of point you can trust an RDF guru to nail, and Danny does so. XML languages are, like all languages, about expression. The farther the expression lies from the abstraction being expressed (the model), the more expensive the maintenance. Punting to an existing format that might have some vague ties to the problem space is a much worse economic bet than the effort of designing a sound and true format for that problem space.

To slightly repurpose another Danny quote towards XML,

...in most cases it’s probably best to initially make up afresh a new representation that matches the domain model as closely as possible(/appropriate). Only then start looking to replacing the new terms with established ones with matching semantics. But don’t see reusing things as more important than getting an (appropriately) accurate model.—Ayers

Ned Batchelder. He correctly identifies that Tim Bray's points tend to be most applicable to document-style XML. I've long since come to the conclusion (again with a lot of influence from Simon St.Laurent) that XML is too often the wrong solution for programmer-data-focused formats (including software configuration formats). Yeah, of course I've already elaborated in the Python context.

[Uche Ogbuji]

via Copia

Misalignments with the planets

John Clark alerted me that Copia has been missing from Planet XML. I noticed it was also missing from PlanetPython. The Planet XML problem turned out to be because the XSLT used to convert the Atom feed to RSS 1.0 was written for Atom 0.3, and so stopped working when I upgraded Copia to Atom 1.0 late last year. I updated the XSLT and that's sorted out (as an unintended result I pwned that planet for a day or two). Planet Python uses a category feed in Atom from Copia, and I think the problem is that the version of Planet used in this aggregator does not yet support Atom 1.0. Planet XML uses its own aggregation software and has supported Atom 1.0 for a while.

There have been moves to update (see this message, for example). Now that FeedParser 4.0 is out with Atom 1.0 support, I expect most planets will start to correct their Atom deficiencies.

Meanwhile, John and I have been working with Sylvain Hellegouarch on yet another planet, using our own aggregation software. More on that later.

[Uche Ogbuji]

via Copia

Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

Recipe: fast scan of an XML file for one field

If you have a huge XML file and you need to grab the first instance of a particular field in a fast and memory efficient manner, a simple one-liner with Amara's pushbind does the trick.

val = unicode(amara.pushbind("book.xml", "title").next())

Returns the text of the first title element in a book.xml (which could be Docbook or any other format with a title element), loading hardly any of the file into memory. It also doesn't parse the file beyond the target element. It would be a shade slower to get such an element at the end of a file. For example, the following line gets the title of a Docbook index.

val = unicode(amara.pushbind("book.xml", "index/title").next())

Even when finding an element near the end of a file it's very fast. All my use cases were pretty much instantaneous working with a 150MB file (I'm working on convincing the client to avoid such huge files).

If the target element is not found, you'll get a StopIteration exception.

[Uche Ogbuji]

via Copia

Copia

Ogbujis on an abundance of topics

Tag xml

Agile Web #3: "Scripting Flickr with Python and REST"

Merging Atom 1.0 feeds with Python

Planet Atom

ElementTree in Python stdlib

Expat 2.0 (featured in 4Suite CVS)

Inventing XML for music

Learn how to invent XML languages, then do so

Some interesting reactions to the piece

Misalignments with the planets

Confusion over Python storage form for Unicode

Recipe: fast scan of an XML file for one field