Earlier this week I had to add an option for 4Suite's XSLT processor to
emit a UTF-8 BOM (or ZERO
WIDTH NON-BREAKING SPACE as I prefer, given the annoying situation).
See details of the 4Suite extension XSLT attribute below. I did so when
upon user request, even though this need seems to come from a case of
sheer lunacy in Windows browsers, and especially MSIE. Mike Brown,
Jeremy Kloth and I originally wondered why on earth anyone would need a
UTF-8 BOM. We figured that by serving his files with the right HTTP
Content-Type
header or using meta/http-equiv
he could signal UTF-8
without needing the BOM (after all, there's no byte order to mark).
Apparently, the problem scenario really kicks in when users have set
their browser's encoding auto-detect to an encoding. In this case most
of the user's clients would have it set to Russian. As Mike Brown said
in the IRC conversation:
it is my understanding that in Russia as well as the Far East it is very typical to leave your browser set to ignore declared encodings and just use whatever is common for your region
The problem is that this user wants to send UTF-8, and it seems to be hard to get browsers on Windows to believe a file is UTF-8 without using a BOM.
Actually, when I researched the situation, it seems that it's merely
hard in Mozilla/Windows (which does pay attention to the HTTP headers,
if not HTML meta/http-equiv
). With MSIE it's apparently impossible.
See "On the 'charset' parameter of the 'Content-Type' header", by Anne
van Kesteren. Her entry itself is tangentially interesting, but see the
comments for the issue at hand, in particular comment #10 by Lachlan
Hunt.
I think Zack may be correct, IE does ignore the Content-Type header in some circumstances. I set up a test case serving the document with Content-Type: text/html; charset=iso-8859-1 but also starting with a UTF-8 BOM. IE incorrectly parses the file as UTF-8, while Firefox and Opera correctly obeyed the Content-Type header. I know the test isn't exactly what he described with Urdu text detected as Turkish, but I don't know those languages, nor whether the file he was talking about was correctly encoded as UTF-8. This test does, however, show that Internet Explorer breaks the rules yet again.
This is what our user was finding as well.
Of course a UTF-8 BOM is never illegal, but that doesn't mean it's not supremely stupid to make such an optional marker the only way to identify UTF-8, despite the availability of multiple alternative mechanisms in standards. Sure, ultimate blame for this goes to all the browser vendors and Web designers over the years who have turned the Web into encoding soup as well as tag soup, but now that the browser standards have all sorts of character encoding markers available, users should no longer find themselves in such a quandary when dealing with Web publishers who are willing to play by the rules.
This is an XML WTF, rather than a general Web WTF because our user was trying to produce XHTML, which means that he should have had recourse to the XML declaration encoding pseudo-attribute. When a Web browser flouts the explicit words of the XML 1.0 spec:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
I started out trying to make sense of the situation by warning the user
it's never really a good idea to serve XHTML as text/html
(regardless
of compatibility guidelines: see Anne
again for an example of
a good argument as to why). I was amazed to find that in the case of
MSIE you couldn't make things right even by using the proper
application/xhtml+xml
. There was nothing I could do but shut my
smacked gob, and then, clenching my teeth all the while, implement the
requested extension.
And coming to that, in 4Suite latest CVS you can use XSLT such as the following:
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ft="http://xmlns.4suite.org/ext" > <xml:output ft:utf-bom="yes"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The xml:output/@ft:utf-bom
attribute is a flag to force the BOM to be
manually emitted. I decided not to do much idiot-proofing as users who
tack on this option had better know what they're doing. In general, if
you use this flag, you'd best ensure your output encoding is UTF-8 (or
UTF-7, if anyone is still using that). The above listing, in effect, is
a variant of the identity transform that tacks on the UTF-8 BOM. This
extention is also available on our exsl:document
implementation.
For more on XHTML browser madness, see "Today's XML WTF: Internal entites in browsers". For more on XHTML overall, see "XHTML, step-by-step".