Ouch. I feel your pain, Sam

Seems Sam Ruby's presentation suffered a bit in the spectacle. Unfortunately, I'm no stranger to presentation set-up problems. I've also been lucky enough to have patient audiences. Maybe conference organizers will someday factor Linux A/V support into consideration when choosing venues (I can dream, eh?). I almost always can use projectors and stuff with no problem in usual business scenarios, and I can only guess that conference venues tend to have archaic A/V technology that doesn't like Linux.

As for the presentation itself, based on the slides much of it is an accumulation of issues probably well known to, say, a long-time XML-DEV reader, but useful to collect in one place. It looks like a much-needed presentation, and I hope Sam gets to present it again, with better luck with the facilities. Here follow a few reactions I had to stuff in the slides.

expat only understands utf-8

This hasn't been true for ages. Expat currently understands UTF-8, UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this list by registering an "unknown encoding" event handler.

Encoding was routinely ignored by most of the initial RSS parsers and even the initial UserLand RSS validator. “Aggregators” did the equivalent of strcat from various sources and left the results for the browser

Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed Parser might not help the situation with its current practice of returning some character data as strings without even guessed encoding information (that I could find, anyway). I found it very hard to build a character-correct aggregator around the Feed Parser 4.0 alpha version. Then again, I understand it's a hard problem with all the character soup ("char soup"?) Web feeds out there.

[Buried] in a non-normative appendix, there is an indication that the encoding specified in an XML document may not be authoritative.

Nope. There is no burial going on. As I thought I've pointed out on Copia before (but I can't find the entry now), section " 4.3.3 Character Encoding in Entities" of XML 1.0 says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

So the normative part of the spec also makes it quite clear that an externally specified encoding can trump what's in the XML or text declaration.

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.

Very apt. I think that's why XML's attributes work as well as they do (despite the fact that they are so inexplicably maligned in some quarters).

In fact, Microsoft’s new policy is that they will always completely ignore [HTTP Content-Type] for feeds—even when the charset is explicitly present

XML of course doesn't force anyone to conform to RFC 3023, but Microsoft could prove itself a really good Web citizen by adopting it. Maybe they could lead the way to reducing the confusion I mention in this entry.

I think of Ruby's section on the WS-* mess to be an excellent indictment of the silly idea of universal and strong data typing.

In general, programming XML is hard.

Indeed it is. Some people seem to think this is a product of architecture astronautics. They are laughably mistaken. XML is hard because managing data is hard. Programmers have developed terrible habits through long years of just throwing their data over the wall at a SQL DBMS and hoping all goes OK in the end. The Web is ruthless in punishing such diffidence.

XML is the first technology that has forced mainstream programmers to truly have to think hard about data. This is a boundlessly good thing. Let the annoyances proliferate (that's your cue, Micah).

[Uche Ogbuji]

via Copia

4 responses

Uche, this is off topic for this post but on topic for the site. It appears that your server is sending

a Content-Type header

text/html;charset=ISO-8859-1

but actually, it sends utf-8. I'm using Firefox on Windows 2000, and Firefix is telling your server that it will accept iso-8859-1. This matters in practice because you use some non-iso-8859-1 characters fairly often.

Regards,

Tom

— Tom Passin

Expat only understands you tee eff DASH eight, not you tee eff eight. My point is that you can get everything else right and get tripped up by a missing dash -- and never quite realize it if the parser you use accepts both, as libxml2 does.

— Sam Ruby

Tom,

I'll investigate that. Thanks. I actually know that there are several areas I have to work on in PyBlosxom to get such things correct. It's just been hard to marshall the time to do so.

Sam,

OK. I guess that's where being in the room during the discussion would have helped me :-)

— Uche

Tom,

It looks like charset=UTF-8 to me, when I check with Mozilla LiveHTTPHeaders. On what specific URL did you see UTF-8? What header sniffing tool were you using?

Thanks.