If you want to help clarify the implementation differences across Python/SAX implementations and you are familiar with XSLT, please respond to Mike Brown's request for help. Andrew Clover's similar work for Python DOM implementations has proved a very useful resource, and it would be nice to have the same for SAX. Mike has done the hard part. He just needs someone to carry it across the finish line. I did discuss some of the unfortunate Python/SAX confusion in "Practical SAX Notes". A tabular analysis would be a nice addition to that discussion.
So today I tried to import OPML (yeah, that very OPML) into Findory (see last entry). The OPML is based on what I originally exported from Lektora and has been through all my feed experiments. A sample entry:
<outline url="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>
What does Findory tell me? 97 feeds rejected for "invalid source".
Great. Now I actually have to get my hands dirty in OPML again. I
check the spec. Of course there's no useful information there. I
eventually found this Wiki of OPML
conventions. I saw the
type='rss' convention, but that didn't seem to make a difference. I also tried
xmlUrl rather than
url, like so:
<outline xmlUrl="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>
This time the Findory import works.
But not only do several of the feed readers I use have
url rather than
xmlUrl, but the XBEL to URL XSLT I've found assumes that name as well.
The conventions page also mentions
text as a way to
provide formatting in some vague way, but I've seen OPML feeds use only
title and nary a
text to be seen anywhere. Besides, what's wrong
with the XML way of allowing formatting: elements rather than
attributes. It's enough to boil the brain.
Speaking of XBEL, that's actually how I'm managing my feeds now, as I'll discuss further in the next entry. Now that Web feeds have become important to me I'll be using a sane format to manage them, thank you very much. I'll do the XSLT thing to export OPML for all the different tools that insist on tag soup. That is, of course, if I can figure out what precise shade of OPML each tool insists on. Today's adventure with feed URL attributes makes me wonder whether there is any escaping the chaos.
Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).
Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.
Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.
It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.
I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:
- "A hands-on introduction to Schematron"—a friendly tutorial by step-by-step examples (free registration required)
- "Discover the flexibility of Schematron abstract patterns"—an introduction to what I think is Schematron's most powerful and least appreciated features
Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.
G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.
Most Pythoneers are familiar with the very handy timeit module. It's a great way to compare Python idioms for performance. I tend to use it from the command line, as in the following.
$ python -m timeit "''.join([ str(i*10000) for i in xrange(100) ])" 10000 loops, best of 3: 114 usec per loop
You can use this method to time multi-line code as well, using multiple command line quoted arguments.
$ python -m timeit "s = ''" "for i in xrange(100):" " s += str(i*10000)" 1000 loops, best of 3: 351 usec per loop
python -m trick is new in Python 2.4. Notice the indentation
built into the third argument string.
As you can imagine, this quickly becomes cumbersome, and it would be nice to have a way to perform such timings on proper script files without too much fiddling.
You can run it on a script, putting the logic to be timed into a
function. The rest of the script's contents will be treated as set-up
and not factored into the timings.
$ cat /tmp/buildstring.py import cStringIO def main(): s = cStringIO.StringIO() for i in xrange(100): s.write(str(i*10000)) $ python timer.py /tmp/buildstring.py 1000 loops, best of 3: 444 usec
timer.py uses the basic logic from timeit. It tries to keep the running time between 0.2 and 2 secs.
I've also added a Windows installer created by Sylvain Hellegouarch, with some help from Jeremy Kloth. It's an installer for Amara "allinone", so all you need is to have installed Python 2.4 for Windows, then you run this installer, and you should be all set.
The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.
Several times while working on RELAX NG in mentoring roles with clients I've had to explain some of the nuances in the various ways to express simple text patterns. In this article I lay out some of the most common distinctions I've had to make. I should say that much of what I know about RELAX NG nuances I learned from Eric van der Vlist and a lot of that wisdom is gathered in his RELAX NG book (in print or online). I recommend the print book because it has some nice additions not in the online version, and because Eric deserves to eat.
I released Amara 1.1.6 last week (see the announcement). This version requires 4Suite XML 1.0b2. As usual, though, I have prepared an "allinone" package so that you do not need to install 4Suite separately to use Amara.
The biggest improvements in ths release are to performance and to the API. Amara takes advantage of a lot of the great performance work that has gone into 4Suite (e.g. Saxlette). There is also a much easier API on-ramp that I expect most users will appreciate. Rather than having to parse using:
from amara import binderytools as bt doc = bt.bind_string(XML) #or bt.bind_uri or bt.bind_file or bt.bind_stream
You can use
import amara amara.parse(XML) #Whether XML is string, file-like object, URI or local file path
Thanks to all the folks who helped with suggestions, patches, review, etc.
Uche Ogbuji's new XML.com column, "Agile Web," explores the intersection of agile programming languages and Web 2.0. In this first installment he examines Google's Sitemaps schema, as well as Python and XSLT code to generate site maps. [Oct. 26, 2005]
I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.
In this first article I introduce the Google SiteMaps program, XML format and Python tools.