Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

The uneven state of Schematron

It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.

I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:

"A hands-on introduction to Schematron"—a friendly tutorial by step-by-step examples (free registration required)
"Discover the flexibility of Schematron abstract patterns"—an introduction to what I think is Schematron's most powerful and least appreciated features

Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.

G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.

[Uche Ogbuji]

via Copia

Quick SQLite note

Parand Tony Darugar left a question about SQLite on one of Chime's entries. I haven't used SQLite, but both Chime and Jeremy Kloth seem to like it. In fact, Jeremy has been testing it as a replacement for the DBM 4RDF driver and 4Suite repository flat file driver. I think all their work has been with SQLite+Python. Jeremy says "[I] rather like sqlite. It is aligned quite well with DB-API and quite fast". "fast, ACID, SQL". He says the minimum embeddable package size is 1.8M of source code.

[Uche Ogbuji]

via Copia

timer.py, a specialization of timeit

Most Pythoneers are familiar with the very handy timeit module. It's a great way to compare Python idioms for performance. I tend to use it from the command line, as in the following.

$ python -m timeit "''.join([ str(i*10000) for i in xrange(100) ])"
10000 loops, best of 3: 114 usec per loop

You can use this method to time multi-line code as well, using multiple command line quoted arguments.

$ python -m timeit "s = ''" "for i in xrange(100):" "    s += str(i*10000)"
1000 loops, best of 3: 351 usec per loop

The python -m trick is new in Python 2.4. Notice the indentation built into the third argument string.

As you can imagine, this quickly becomes cumbersome, and it would be nice to have a way to perform such timings on proper script files without too much fiddling.

Jeremy Kloth scratched that itch, coming up with timer.py. I bundle it in the test directory of Amara, but you can also grab it directly from CVS.

You can run it on a script, putting the logic to be timed into a main function. The rest of the script's contents will be treated as set-up and not factored into the timings.

$ cat /tmp/buildstring.py
import cStringIO

def main():
    s = cStringIO.StringIO()
    for i in xrange(100):
        s.write(str(i*10000))
$ python timer.py /tmp/buildstring.py
1000 loops, best of 3: 444 usec

timer.py uses the basic logic from timeit. It tries to keep the running time between 0.2 and 2 secs.

[Uche Ogbuji]

via Copia

Amara API quick reference, and Windows packages

I forgot to mention in the Amara 1.1.6 announcement that I drafted an API quick reference. I've put a link to it on the Amara home page.

I've also added a Windows installer created by Sylvain Hellegouarch, with some help from Jeremy Kloth. It's an installer for Amara "allinone", so all you need is to have installed Python 2.4 for Windows, then you run this installer, and you should be all set.

[Uche Ogbuji]

via Copia

"Ogbuji": those good ol' labial-velar plosives

I read in Serendipity about the new IPA phonetic symbol (first new one in twelve years).

It is a phoneme in several African languages, among which Mono.

This got me to thinking that I didn't know how to put my own name in IPA. I learned how to use IPA in various adventures in language-related Usenet newsgroups ten years or so ago, but for some reason I don't remember having hunted down the various sounds in Igbo that are fairly rare in other languages.

A good example is my own last name (and Chimezie's, of course): "Ogbuji". I tell most people to just treat it like a silent "g", but that's really a bit of an ugly approximation. As an example of how different "b" and "gb" are in igbo, the word "ebe" with high tone on both syllables means "where" (as subordinating conjunction or interrogative). The word "egbe" means "kite" (as in hawk-like bird). You probably don't want to mix those two up. There are many other such cases.

When people really do want to try to say it rightly, (so, for example when Lori wanted to be sure she was getting her new last name rightly), I tell them that the "gb" sounds a bit as if you filled your mouth with air and forced out the air suddenly, as if saying "b", while at the same time making the "g" sound in the back of your throat. Hmm. Surely linguists have to have a better description.

After some very enjoyable browsing through the Wikipedia's IPA section (which is very well done), I quickly found Igbo "gb" filed under "voiced labial-velar plosive" (/g͡b/). The description in Wikipedia of how to pronounce it is like mine, but more terse:

The voiceless labial-velar plosive is commonly found in West and Central Africa. To pronounce it, try saying [g], but simultaneously close your lips as you would for [b].

"voiceless" above is a typo for "voiced". There is a voiceless variant, more about that in a bit. Many phonetic symbol entries in Wikipedia have audio clips so you can hear the sound spoken, but not this one.

So the IPA for my last name is /oˑg͡buˑdʒiˑ/ (all the vowels are half-long).

A related Igbo sound is "kp" ("okpo" with low then high tone = "shrine", "opo" = "leprosy") which is the "voiceless labial-velar plosive" (/k͡p/)

The voiceless labial-velar plosive is commonly found in West and Central Africa. To pronounce it, try saying [k], but simultaneously close your lips as you would for [p].

There are other Igbo sounds that don't really exist in English, such as the uvular nasal "n" (/ɴ/) and the closed (and nasal) forms of i (/ɨ/) and u (/ɤ/).

Uche Ogbuji

via Copia

Woohoo! 2 phase commit for PostgreSQL

Recently discussion of distributed transactions came up in the context of the 4Suite repository redesign discussions. We will probably need to move to XA-style two-phase commits (2PC). The problem is that Oracle supports 2PC, but not so many other back ends with 4Suite drivers. Luckily I've found that my favorite DBMS will soon support 2PC. The patch is in for PostgreSQL and became available in the 8.1beta4 release. It should be in production for PG 8.1 (currently in release candidate). Thanks to Tom, Heikki and Alvaro for this wonderful addition. For some interesting discussion of the 2PC patch, see this important thread. One of the well-known issues brought up is that 2PC in almost all implementations is much slower than straight commit. I think we'll have to make our transaction manager smart enough to use direct commit if it detects that all resources are in the same database instance. I've written transaction managers before (specifically in implementing CORBA CosTransactions in Python on top of OmniORB) and I remember such tricks all too well.

Even once PG 8.1 is available we'll still have to deal with non XA-aware back ends. All we can do is fall back to the Last Resource Commit technique in the transaction manager. I found a useful discussion of this technique in "What to do if your database doesn't support XA transactions?", by Dmitri Maximovich. You actually risk losing both atomicity and isolation with this technique, not just atomicity, as Dmitri says, but if you choose minimally competent back ends, such failures should be very unlikely.

[Uche Ogbuji]

via Copia

"Tip: Use the right pattern for simple text in RELAX NG"

The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.

Several times while working on RELAX NG in mentoring roles with clients I've had to explain some of the nuances in the various ways to express simple text patterns. In this article I lay out some of the most common distinctions I've had to make. I should say that much of what I know about RELAX NG nuances I learned from Eric van der Vlist and a lot of that wisdom is gathered in his RELAX NG book (in print or online). I recommend the print book because it has some nice additions not in the online version, and because Eric deserves to eat.

[Uche Ogbuji]

via Copia

Amara 1.1.6

I released Amara 1.1.6 last week (see the announcement). This version requires 4Suite XML 1.0b2. As usual, though, I have prepared an "allinone" package so that you do not need to install 4Suite separately to use Amara.

The biggest improvements in ths release are to performance and to the API. Amara takes advantage of a lot of the great performance work that has gone into 4Suite (e.g. Saxlette). There is also a much easier API on-ramp that I expect most users will appreciate. Rather than having to parse using:

from amara import binderytools as bt
doc = bt.bind_string(XML) #or bt.bind_uri or bt.bind_file or bt.bind_stream

You can use

import amara
amara.parse(XML) #Whether XML is string, file-like object, URI or local file path

There are several other such simplifications. There is also the xml_append_template facility, which is very handy for generating XML (see how Sylvain uses it to simplify atomixlib).

Thanks to all the folks who helped with suggestions, patches, review, etc.

[Uche Ogbuji]

via Copia

Intimations of an evil Google?

The rumblings about innocent people getting caught in Google's super-secret spam and fraud detection systems is becoming impossible to ignore. The more I hear stories from folks I respect (Ned Batchelder's case is the latest I've run across), the more I think Google has a burgeoning problem of its hands. Google's benefits greatly from its "do no evil" reputation, but this reputation is starting to wear some seriously grungy shadows.

I'm also pondering the off chance that I may myself have fallen victim to some over-zealous fraud cop at Google. Recently I noticed that my personal home pages disappeared from Google search results. I recently changed to a hacked-up CherryPy set-up for these pages, and my first thought was that I'd done something violating some prime directive of search engine optimization. I've never paid much attention to SEO, so I figured I'd look into it when I had a moment and fix meta tags or whatever was looking skunky to Google's indexer. Then I noticed that the pages in question continue to occupy their typically high ranking on Yahoo search. I know the two indexes use different algorithms, but the contrast seems too sharp to be a question of simple metadata massage.

Then while doing research for my article "Google Sitemaps" I happened across this discussion of "exclusion" and "reinclusion" from Google indexes. This is the first I've ever heard of such matters, but it does rather feel like what happened to my home pages. I can't imagine what I would have done to fall afoul of Google's anti-fraud system considering I've never been the slightest bit interested in SEO, and my high ranking has always come because a gratifying number of people seem to like and link to what I write.

I'll be looking further into whether Google might have excluded my pages, but regardless of what's going on in my particular case, based on what I'm reading about Google lately, I'm beginning to think the company is really straining with the effort of balancing good vibes and break-neck growth.

[Uche Ogbuji]

via Copia