Today's XML WTF: UTF-8 BOM madness in Windows browsers

Earlier this week I had to add an option for 4Suite's XSLT processor to emit a UTF-8 BOM (or ZERO WIDTH NON-BREAKING SPACE as I prefer, given the annoying situation). See details of the 4Suite extension XSLT attribute below. I did so when upon user request, even though this need seems to come from a case of sheer lunacy in Windows browsers, and especially MSIE. Mike Brown, Jeremy Kloth and I originally wondered why on earth anyone would need a UTF-8 BOM. We figured that by serving his files with the right HTTP Content-Type header or using meta/http-equiv he could signal UTF-8 without needing the BOM (after all, there's no byte order to mark). Apparently, the problem scenario really kicks in when users have set their browser's encoding auto-detect to an encoding. In this case most of the user's clients would have it set to Russian. As Mike Brown said in the IRC conversation:

it is my understanding that in Russia as well as the Far East it is very typical to leave your browser set to ignore declared encodings and just use whatever is common for your region

The problem is that this user wants to send UTF-8, and it seems to be hard to get browsers on Windows to believe a file is UTF-8 without using a BOM.

Actually, when I researched the situation, it seems that it's merely hard in Mozilla/Windows (which does pay attention to the HTTP headers, if not HTML meta/http-equiv). With MSIE it's apparently impossible. See "On the 'charset' parameter of the 'Content-Type' header", by Anne van Kesteren. Her entry itself is tangentially interesting, but see the comments for the issue at hand, in particular comment #10 by Lachlan Hunt.

I think Zack may be correct, IE does ignore the Content-Type header in some circumstances. I set up a test case serving the document with Content-Type: text/html; charset=iso-8859-1 but also starting with a UTF-8 BOM. IE incorrectly parses the file as UTF-8, while Firefox and Opera correctly obeyed the Content-Type header. I know the test isn't exactly what he described with Urdu text detected as Turkish, but I don't know those languages, nor whether the file he was talking about was correctly encoded as UTF-8. This test does, however, show that Internet Explorer breaks the rules yet again.

This is what our user was finding as well.

Of course a UTF-8 BOM is never illegal, but that doesn't mean it's not supremely stupid to make such an optional marker the only way to identify UTF-8, despite the availability of multiple alternative mechanisms in standards. Sure, ultimate blame for this goes to all the browser vendors and Web designers over the years who have turned the Web into encoding soup as well as tag soup, but now that the browser standards have all sorts of character encoding markers available, users should no longer find themselves in such a quandary when dealing with Web publishers who are willing to play by the rules.

This is an XML WTF, rather than a general Web WTF because our user was trying to produce XHTML, which means that he should have had recourse to the XML declaration encoding pseudo-attribute. When a Web browser flouts the explicit words of the XML 1.0 spec:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

I started out trying to make sense of the situation by warning the user it's never really a good idea to serve XHTML as text/html (regardless of compatibility guidelines: see Anne again for an example of a good argument as to why). I was amazed to find that in the case of MSIE you couldn't make things right even by using the proper application/xhtml+xml. There was nothing I could do but shut my smacked gob, and then, clenching my teeth all the while, implement the requested extension.

And coming to that, in 4Suite latest CVS you can use XSLT such as the following:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:ft="http://xmlns.4suite.org/ext"
>

<xml:output ft:utf-bom="yes"/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

The xml:output/@ft:utf-bom attribute is a flag to force the BOM to be manually emitted. I decided not to do much idiot-proofing as users who tack on this option had better know what they're doing. In general, if you use this flag, you'd best ensure your output encoding is UTF-8 (or UTF-7, if anyone is still using that). The above listing, in effect, is a variant of the identity transform that tacks on the UTF-8 BOM. This extention is also available on our exsl:document implementation.

For more on XHTML browser madness, see "Today's XML WTF: Internal entites in browsers". For more on XHTML overall, see "XHTML, step-by-step".

[Uche Ogbuji]

via Copia
5 responses
Anne is "his", not "her".
Eeek.  Sorry, Anne.  In this global village guessing someone's sex by name is hazardous.  Thanks, Jaya.



BTW, everyone thinks my 3 year old boy Jide is a girl, even after meeting him.  Doesn't faze me a bit, so maybe I get a few credits from others for my own patience?  :-)
That's weird, it's the first time I see "Anne" used for a boy. In France it's a girl's firstname only AFAIK.



Uche, as you say, guessing someone's sex from his/her firstname is a bit dangerous :)
Hey Uche,



Very good points.  One thing I think there needs to be allowance for is the fact that IE, for all intents and purposes, was the Windows browser, and it seemed this was going to be the way it was until such time as the browser morphed into a distributed application framework, with heavy emphasis on the GUI side of such a framework. 



IE, as Adam Bosworth helps bring clarity to in his MySQL keynote, had to build upon "sloppyness"

to allow for the fact that not everybody on the planet has the ability to develop a properly structured and syntactically correct code base, or as he puts it "High Priests of Syntax".  So while the web is moving into its high school years and the once Big Man On Campus is looking like his Mom still picks out his clothes (and probably dresses him too) while the once "nobody's" are beginning to blossom into something pretty spectacular, we shouldnt forget about the fact that one of the primary design flaws that exists in IE today, that of allowing for "sloppyness" and as such not conforming to the current standards, if MS forced the web into a specific way of doing things then suffice it to say, the web wouldnt be where it is today.



Now, with that said, I can understand and agree with the notion that this would mean Netscape would still be king and that wouldnt be such a bad thing, I would assume that it would then be Netscape who was now being beaten to death by the standards advocates who, primarily, in one form or another are the direct competition who have found IE's achilles heel and are beating their war drums to make sure everyone knows what that achilles heel is.



On the flip side, what Microsoft was often criticized for in the past, that of making browser specific extensions such as XMLHTTP, the standards groups have latched onto, suggesting "hey look, we are building a base of standards that everyone can use cross platform."  So what they once used as a way to criticize MS, they are now using as a way to boast their own "community" focused effort to develop a base of "standards" that can be used on any browser, using the same code base. 



It seems that there is always a way to twist things into ones favor.



A year ago, Firefox was beginning to get some HUGE notice.  As a result, MS woke up and realized "we obviously got some catching up to do" which is ironic given they had originally set the stage and then moved on to what they assumed was the bigger and better thing... direct integration of the web into the application API such that the browser was now an web-based application framework.  I think thats what we all want it to eventually get to.  We're just going about it by first backtracking so that we can then realize why things like XForms and SVG, then XUL, and XAML, etc... we're developed in the first place...



It's good though, as it gives us time to really pressure cook things.  And I think it has also helped in progressing the KISS-based notion that simple is always better.  Complexity (like forcing a very young web at the time to become high priests of syntax) tends to leave 99% of the world behind, allowing only a select few to truly understand how things work.  If the internet didn't exist, that would probably be exactly what was still taking place in the tech world, as when you live on an island, you tend to do things differently than someone else on a completely different island.  But when those islands become one big mass of land, the idea of "island living" just doesn't work anymore.



Back to the BOM... that needs to be fixed. Spot on.  Lets see how MS responds.  Lets hope they respond well :)



Enjoy your day!
I expect IE7 will be a lit less stubborn about UTF-8 BOM.  Indicaions on that project give hope for better standards compliance.  Mozilla also have a bit of work to do as well.  We'll see.