Today's XML WTF

via Sam Ruby:

While [REXML] is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error: [<div>at&t]

F'real? That is, not only missing end tag, but also unescaped ampersand?

It is just not frigging cool to be releasing anything called an XML parser or processor in 2005 that does not reject ill-formed XML. Folks, well-formedness is the entire point of XML. If that's an inconvenient fact for you, please be so kind as to use something other than XML. What is even more galling is this from the REXML home page:

REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support.

On Sam's evidence (and you don't get much more credible than Sam Ruby), this statement is quite false. The OASIS XML 1.0 tests have a whole section covering rejection of non-well-formed documents.

Sam goes on to say:

Peeking into the implementation of REXML, I see that it is riddled with regular expressions. Having a parser that doesn’t detect errors properly is one thing, but having a parser that incorrectly parses valid input is quite another. I’ve opened a ticket on one such problem.  Depending on how it is received, I may open others.

OK. Let's hope the REXML folks pay attention to Sam and get things right.

And before Python folks get all smug, it seems that such fast and loose interpretations of what "XML" means is hardly alien to the Python community. Here's a thread on the XML-SIG with a "list of packages handling XML 1.1". Any sensible person would expect these to be XML 1.1 parsers, but no, it turns out that the title is a bit of casuistry, and that at least 2 of the 4 listed packages accept ill-formed XML 1.1. It seems to me that pyparsing, Python's re library, Python's string methods, and any other Python software that does anything with strings should be added to such a list. The only way I could imagine such a list being redeemed is if entries that did not accept well-formed XML 1.1 at least offered warnings of ill-formedness, and could thus serve as tidy-like tools for fixing broken XML. This does not seem to be the case.

As I've said in the past, I don't claim that only XML parsers and processors should be used to work with XML. Heck, I use grep, wc, sed and the usual text tools all the time with XML documents. I do say that it is dishonest to call something an XML parser or processor unless you treat non-compliance as a bug. I guess it's the old social principle all over again. XML is hot, so it's voguish to be called an XML processor, yet it's all so tempting to shirk the required work.

[Uche Ogbuji]

via Copia