Seems Sam Ruby's presentation suffered a bit in the spectacle.
Unfortunately, I'm no stranger to presentation set-up
problems. I've
also been lucky enough to have patient audiences. Maybe conference
organizers will someday factor Linux A/V support into consideration when
choosing venues (I can dream, eh?). I almost always can use projectors
and stuff with no problem in usual business scenarios, and I can only
guess that conference venues tend to have archaic A/V technology that
doesn't like Linux.
As for the presentation itself, based on the slides much of it is an accumulation of issues
probably well known to, say, a long-time XML-DEV reader, but useful to
collect in one place. It looks like a much-needed presentation, and I
hope Sam gets to present it again, with better luck with the facilities.
Here follow a few reactions I had to stuff in the slides.
expat only understands utf-8
This hasn't been true for ages. Expat currently understands UTF-8,
UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this
list by registering an "unknown encoding" event handler.
Encoding was routinely ignored by most of the initial RSS parsers and
even the initial UserLand RSS validator. “Aggregators” did the
equivalent of strcat from various sources and left the results for the
browser
Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed
Parser might not help the situation with its
current practice of returning some character data
as strings without even guessed encoding information (that I could find,
anyway). I found it very hard to build a character-correct aggregator
around the Feed Parser 4.0 alpha version. Then again, I understand it's
a hard problem with all the character soup ("char soup"?) Web feeds out there.
[Buried] in a non-normative appendix, there is an indication that the
encoding specified in an XML document may not be authoritative.
Nope. There is no burial going on. As I thought I've pointed out on
Copia before (but I can't find the entry now), section "
4.3.3 Character Encoding in
Entities" of XML 1.0 says:
In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration.
So the normative part of the spec also makes it quite clear that an
externally specified encoding can trump what's in the XML or text
declaration.
The accuracy of metadata is inversely proportional to the square of
the distance between the data and the metadata.
Very apt. I think that's why XML's attributes work as well as they do
(despite the fact that they are so inexplicably maligned in some
quarters).
In fact, Microsoft’s new policy is that they will always completely ignore
[HTTP Content-Type] for feeds—even when the charset is explicitly
present
XML of course doesn't force anyone to conform to RFC
3023, but Microsoft could prove
itself a really good Web citizen by adopting it. Maybe they could lead
the way to reducing the confusion I mention in this
entry.
I think of Ruby's section on the WS-* mess to be an excellent indictment
of the silly idea of universal and strong data typing.
In general, programming XML is hard.
Indeed it is. Some people seem to think this is a product of
architecture astronautics. They are laughably mistaken. XML is hard
because managing data is hard. Programmers have developed terrible
habits through long years of just throwing their data over the wall at a
SQL DBMS and hoping all goes OK in the end. The Web is ruthless in
punishing such diffidence.
XML is the first technology that has forced mainstream programmers to
truly have to think hard about data. This is a boundlessly good thing.
Let the annoyances proliferate (that's your cue,
Micah).
[Uche Ogbuji]