Today's XML WTF: Internal entites in browsers

This unnecessary screw-up comes from the Mozilla project, of all places. Mozilla's XML support is improving all the time, as I discuss in my article on XML in Firefox, but the developer resources seem to lag the implementation, and this often leads to needless confusion. One that I ran into recently could perhaps be given the summary: "not everything in the Mozilla FAQ is accurate". From the Mozilla FAQ:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support entities (other than the five pre-defined ones), the use of entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.)

See the part in bold. Someone either didn't read the spec, or is intentionally throwing up a spec distortion field. The XML 1.0 spec provides a table in section 4.4: "XML Processor Treatment of Entities and References" which tells you how parsers are allowed to treat entities, and it flatly contradicts the bogus Mozilla FAQ statement above.

The main reason for the "WTF" is the fact that the Mozilla implementation actually gets it right. That it should. It's based on Expat. AFAIK Expat has always got this right (I've been using Expat about as long as the Mozilla project has been), so I'm not sure what inspired the above error. Mozilla should be touting its correct and useful behavior, rather than giving bogus excuses to its competitors.

This came up last week in the IBM developerWorks forum where a user was having problems with internal entities in XHTML. It turns out that he was missing an XHTML namespace (and based on my experimentation was probably serving up XHTML as text/html which is generally a no-no). It should have been a clear case of "Mozilla gets this right, and can we please get other browsers to fix their bugs?" but he found that FAQ entry and we both ended up victims of the red herring for a little while.

I didn't realize that the Mozilla implementation was right until I wrote a careful test case in preparation for my next Firefox/XML article. The following CherryPy code is a test server set-up for browser rendering of XHTML.

import cherrypy

INTENTITYXHTML = '''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd" [
<!ENTITY internal "This is text placed as internal entity">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
    <title>Using Entity in xhtml</title>
  </head>
  <body>
    <p>This is text placed inline</p>
    <p>&internal;</p>
    <abbr title="&internal;">Titpaie</abbr>
  </body>
</html>
'''

class root:
    @cherrypy.expose
    def text_html(self):
        cherrypy.response.headerMap['Content-Type'] = "text/html; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def text_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "text/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xhtml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xhtml+xml; charset=utf-8"
        return INTENTITYXHTML

cherrypy.root = root()
cherrypy.config.update({'server.socketPort': 9999})
cherrypy.config.update({'logDebugInfoFilter.on': False})
cherrypy.server.start()

As an example, this code serves up a content type text/html when accessed through a URL such as http://localhost:9999/text_html. You should be able to work out the other URL to content type mappings from the code, even if you're not familiar with CherryPy or Python.

Firefox 1.0.7 handles all this very nicely. For text_xml, app_xml and app_xhtml you get just the XHTML rendering you'd expect, including the correct text in the attribute value with the mouse hovered over "Titpaie".

IE6 (Windows) and Safari 1.3.1 (OS X Panther) both have a lot of trouble with this.

IE6 in the text_xml and app_xml cases complains that it can't find http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd. In the app_xhtml case it treats the page as a download, which is reasonable, if not convenient.

Safari in the text_xml, app_xml and app_xhtml cases complains that the entity internal is undefined (??!!).

IE6, Safari and Mozilla in the text_html case all show the same output (looking, as it should, like busted HTML). That's just what you'd expect for a tag soup mode, and emphasizes hat you should leave text_html out of your XHTML vocabulary.

All this confusion and implementation difference illustrates the difficulty for folks trying to deploy XHTML, and why it's probably not yet realistic to deploy XHTML without some sort of browser sniffing (perhaps by checking the Accept header, though it's well known that browsers are sometimes dishonest with this header). I understand that the MSIE7 team hopes to address such problems. I don't know whether to expect the same from Safari. My focus in research and experimentation has been on Firefox.

One final note is that Mozilla does not support external parsed entities. This is legal (and some security experts claim even prudent). The relevant part of the XML 1.0 spec is section 4.4.3:

When an XML processor recognizes a reference to a parsed entity, in order to validate the document, the processor MUST include its replacement text. If the entity is external, and the processor is not attempting to validate the XML document, the processor MAY, but need not, include the entity's replacement text. If a non-validating processor does not include the replacement text, it MUST inform the application that it recognized, but did not read, the entity.

I would love Mozilla to adopt the idea in the next spec paragraph:

Browsers, for example, when encountering an external parsed entity reference, might choose to provide a visual indication of the entity's presence and retrieve it for display only on demand.

That would be very useful. I wonder whether it would be possible through a Firefox plug-in (probably not: I guess it would require very tight Expat integration for plug-ins).

[Uche Ogbuji]

via Copia
1 response
I am the maintainer of the FAQ. Did you email me about this or file a bug? Anyway, I have now refined the stament to narrow it to externally defined character entities. The vast majority of authors cares only about entities that expand to a single character and do not care about entities in general as meant in the XML spec. Also, the XML spec allows not-validating XML processors not to process enternal entities, so the entity definitions may never be seen.



Now here's the bogosity: In order to support localization of local XUL files via entity references and in order to support pseudoversions of a few well-known XHTML and MathML DTDs, Mozilla run expat in the enternal entity-resolving mode unless the document is declared standalone. If the document references an external entity that is neither a local localization file nor a DTD for which there is a mapping in pseudo-DTD catalog, the entity resolver feeds expat a zero-length stream as the DTD. Now expat thinks it has seen all the external entities and, therefore, all the entity definitions. When expat hits a reference to an unknown entity, it reports a fatal error instead of informing the app about a skipped entity.



Anyway, all this can be avoided by staying away from entity references (by parsing and reserializing on the server if necessary).