Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Ouch. I feel your pain, Sam

Seems Sam Ruby's presentation suffered a bit in the spectacle. Unfortunately, I'm no stranger to presentation set-up problems. I've also been lucky enough to have patient audiences. Maybe conference organizers will someday factor Linux A/V support into consideration when choosing venues (I can dream, eh?). I almost always can use projectors and stuff with no problem in usual business scenarios, and I can only guess that conference venues tend to have archaic A/V technology that doesn't like Linux.

As for the presentation itself, based on the slides much of it is an accumulation of issues probably well known to, say, a long-time XML-DEV reader, but useful to collect in one place. It looks like a much-needed presentation, and I hope Sam gets to present it again, with better luck with the facilities. Here follow a few reactions I had to stuff in the slides.

expat only understands utf-8

This hasn't been true for ages. Expat currently understands UTF-8, UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this list by registering an "unknown encoding" event handler.

Encoding was routinely ignored by most of the initial RSS parsers and even the initial UserLand RSS validator. “Aggregators” did the equivalent of strcat from various sources and left the results for the browser

Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed Parser might not help the situation with its current practice of returning some character data as strings without even guessed encoding information (that I could find, anyway). I found it very hard to build a character-correct aggregator around the Feed Parser 4.0 alpha version. Then again, I understand it's a hard problem with all the character soup ("char soup"?) Web feeds out there.

[Buried] in a non-normative appendix, there is an indication that the encoding specified in an XML document may not be authoritative.

Nope. There is no burial going on. As I thought I've pointed out on Copia before (but I can't find the entry now), section " 4.3.3 Character Encoding in Entities" of XML 1.0 says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

So the normative part of the spec also makes it quite clear that an externally specified encoding can trump what's in the XML or text declaration.

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.

Very apt. I think that's why XML's attributes work as well as they do (despite the fact that they are so inexplicably maligned in some quarters).

In fact, Microsoft’s new policy is that they will always completely ignore [HTTP Content-Type] for feeds—even when the charset is explicitly present

XML of course doesn't force anyone to conform to RFC 3023, but Microsoft could prove itself a really good Web citizen by adopting it. Maybe they could lead the way to reducing the confusion I mention in this entry.

I think of Ruby's section on the WS-* mess to be an excellent indictment of the silly idea of universal and strong data typing.

In general, programming XML is hard.

Indeed it is. Some people seem to think this is a product of architecture astronautics. They are laughably mistaken. XML is hard because managing data is hard. Programmers have developed terrible habits through long years of just throwing their data over the wall at a SQL DBMS and hoping all goes OK in the end. The Web is ruthless in punishing such diffidence.

XML is the first technology that has forced mainstream programmers to truly have to think hard about data. This is a boundlessly good thing. Let the annoyances proliferate (that's your cue, Micah).

[Uche Ogbuji]

via Copia

Help needed with Python/SAX implementation comparison

If you want to help clarify the implementation differences across Python/SAX implementations and you are familiar with XSLT, please respond to Mike Brown's request for help. Andrew Clover's similar work for Python DOM implementations has proved a very useful resource, and it would be nice to have the same for SAX. Mike has done the hard part. He just needs someone to carry it across the finish line. I did discuss some of the unfortunate Python/SAX confusion in "Practical SAX Notes". A tabular analysis would be a nice addition to that discussion.

[Uche Ogbuji]

via Copia

Programmatic Access to Repository (SOAP/FtRPC)

This is meant to be a follow-up to my last entry to cover the programmatic remote procotols supported by the 4Suite repository.

FtRPC

The 4Suite repository supports an internal RPC protocol with a Python implementation that provides programmatic access to the repository. The automated 4Suite build process has recently changed significantly so you can browse the Python API documentation (courtesy of John L Clark's build). Each repository could serve it's own instance (the default port is 8803)

SOAP

The 4Suite repository supports a SOAP mapping that essentially attempts to serve as a translation mechanism between SOAP and the internal repository API. A repository instance can manage a SOAP server instance (the default port is 8090).

SOAP Service Namespace

The namespace URI associated with the SOAP service is:

http://xmlns.4suite.org/reserved#services

Authentication

Each SOAP message can have authentication information in the SOAP Header. The format is:

<SOAP-ENV:Header>
  <ftsoap:authenticationHeader>
    <ftsoap:sessionId>..</>
    <ftsoap:sessionKey>..</>
    <ftsoap:authenticatingUser>.. user .. </>
    <ftsoap:authenticatingPassword></>
  </>
</>

where SOAP is bound to:

http://schemas.xmlsoap.org/soap/envelope/Header

Session authentication is supported with the first two header entries. The other two are for simple / basic authentication (very similar to HTTP scenario).

Message-to-Repo API Mapping

SOAP messages are invoked against repository resources. The local name of the SOAP-ENV:Body child element (in the ftsoap namespace) is mapped to the name of the method to invoke. The child elements are mapped to parameters to the methods. There are certain special parameters:

  • scrpath (the repository path of the resource to execute the method against)
  • base64 (a boolean value indicating whether or not the content is Base 64 encoded – fault by default)
  • src (the content – transmitted as pure text or Base 64 encoded)
  • updateSrc (used by the xUpdate method as the XUpdate document – transmitted as pure text or Base 64 encoded)

The method is invoked with one of the following as the response:

  • ftsoap:successReponse (if there is nothing returned)
  • ftsoap:valueResponse (the value returned – it's string representation)
  • ftsoap:Resource (if a resource itself is returned)
  • SOAP-ENV:Fault (includes Base 64 encoded traceback string)

ftsoap:Resource diagram

[Uche Ogbuji]

via Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

The uneven state of Schematron

It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.

I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:

Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.

G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.

[Uche Ogbuji]

via Copia

Quick SQLite note

Parand Tony Darugar left a question about SQLite on one of Chime's entries. I haven't used SQLite, but both Chime and Jeremy Kloth seem to like it. In fact, Jeremy has been testing it as a replacement for the DBM 4RDF driver and 4Suite repository flat file driver. I think all their work has been with SQLite+Python. Jeremy says "[I] rather like sqlite. It is aligned quite well with DB-API and quite fast". "fast, ACID, SQL". He says the minimum embeddable package size is 1.8M of source code.

[Uche Ogbuji]

via Copia

timer.py, a specialization of timeit

Most Pythoneers are familiar with the very handy timeit module. It's a great way to compare Python idioms for performance. I tend to use it from the command line, as in the following.

$ python -m timeit "''.join([ str(i*10000) for i in xrange(100) ])"
10000 loops, best of 3: 114 usec per loop

You can use this method to time multi-line code as well, using multiple command line quoted arguments.

$ python -m timeit "s = ''" "for i in xrange(100):" "    s += str(i*10000)"
1000 loops, best of 3: 351 usec per loop

The python -m trick is new in Python 2.4. Notice the indentation built into the third argument string.

As you can imagine, this quickly becomes cumbersome, and it would be nice to have a way to perform such timings on proper script files without too much fiddling.

Jeremy Kloth scratched that itch, coming up with timer.py. I bundle it in the test directory of Amara, but you can also grab it directly from CVS.

You can run it on a script, putting the logic to be timed into a main function. The rest of the script's contents will be treated as set-up and not factored into the timings.

$ cat /tmp/buildstring.py
import cStringIO

def main():
    s = cStringIO.StringIO()
    for i in xrange(100):
        s.write(str(i*10000))
$ python timer.py /tmp/buildstring.py
1000 loops, best of 3: 444 usec

timer.py uses the basic logic from timeit. It tries to keep the running time between 0.2 and 2 secs.

[Uche Ogbuji]

via Copia