"Tip: Use the Unicode database to find characters for XML documents"

"Tip: Use the Unicode database to find characters for XML documents"

The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.

The summary says it all, really.

[Uche Ogbuji]

via Copia

Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

Ouch. I feel your pain, Sam

Seems Sam Ruby's presentation suffered a bit in the spectacle. Unfortunately, I'm no stranger to presentation set-up problems. I've also been lucky enough to have patient audiences. Maybe conference organizers will someday factor Linux A/V support into consideration when choosing venues (I can dream, eh?). I almost always can use projectors and stuff with no problem in usual business scenarios, and I can only guess that conference venues tend to have archaic A/V technology that doesn't like Linux.

As for the presentation itself, based on the slides much of it is an accumulation of issues probably well known to, say, a long-time XML-DEV reader, but useful to collect in one place. It looks like a much-needed presentation, and I hope Sam gets to present it again, with better luck with the facilities. Here follow a few reactions I had to stuff in the slides.

expat only understands utf-8

This hasn't been true for ages. Expat currently understands UTF-8, UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this list by registering an "unknown encoding" event handler.

Encoding was routinely ignored by most of the initial RSS parsers and even the initial UserLand RSS validator. “Aggregators” did the equivalent of strcat from various sources and left the results for the browser

Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed Parser might not help the situation with its current practice of returning some character data as strings without even guessed encoding information (that I could find, anyway). I found it very hard to build a character-correct aggregator around the Feed Parser 4.0 alpha version. Then again, I understand it's a hard problem with all the character soup ("char soup"?) Web feeds out there.

[Buried] in a non-normative appendix, there is an indication that the encoding specified in an XML document may not be authoritative.

Nope. There is no burial going on. As I thought I've pointed out on Copia before (but I can't find the entry now), section " 4.3.3 Character Encoding in Entities" of XML 1.0 says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

So the normative part of the spec also makes it quite clear that an externally specified encoding can trump what's in the XML or text declaration.

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.

Very apt. I think that's why XML's attributes work as well as they do (despite the fact that they are so inexplicably maligned in some quarters).

In fact, Microsoft’s new policy is that they will always completely ignore [HTTP Content-Type] for feeds—even when the charset is explicitly present

XML of course doesn't force anyone to conform to RFC 3023, but Microsoft could prove itself a really good Web citizen by adopting it. Maybe they could lead the way to reducing the confusion I mention in this entry.

I think of Ruby's section on the WS-* mess to be an excellent indictment of the silly idea of universal and strong data typing.

In general, programming XML is hard.

Indeed it is. Some people seem to think this is a product of architecture astronautics. They are laughably mistaken. XML is hard because managing data is hard. Programmers have developed terrible habits through long years of just throwing their data over the wall at a SQL DBMS and hoping all goes OK in the end. The Web is ruthless in punishing such diffidence.

XML is the first technology that has forced mainstream programmers to truly have to think hard about data. This is a boundlessly good thing. Let the annoyances proliferate (that's your cue, Micah).

[Uche Ogbuji]

via Copia


Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have people chosen to forget how many billions of people there are on the planet? How many languages, cultures and writing systems are represented in this number? What the heck do you expect besides complex and arbitrary? Or do people think it would be better to exclude huge sections of the populace from information technology just to make their programmer lives that marginal bit easier? Or maybe people think it would be easier to deal separately with each of the hundreds of local encoding text schemes that were incorporated into Unicode? What the hell?

Listen. Unicode is hard. Unicode will melt your brain at times, it's the worst system out there, except for all the alternatives. (Yeah, yeah, the old chestnut). But if you want to produce software that is ready for a global user base (actually, you often can't escape the need for character i18n even if you're just writing for your hamlet), you have little choice but to learn it, and you'll be a much better developer once you've learned it properly. Read it from the well-regarded Joel Spolsky: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". And remember, as Joel says, "There Ain't No Such Thing As Plain Text".

As for Python, that language has an excellent Unicode implementation. A group of brilliant contributors tackled a barge load of pitfalls and complexities involved in shunting full Unicode support into the workings of a language that was already in heavy use. I think they nailed most of the compromises, and again a lot of the things that people rail at as arbitrary and obstructing are so precisely because other (in some cases superficially attractive) alternatives are even worse. Almost all languages that support Unicode have snags of their own (see above: Unicode is hard). It's instructive to read Norbert Lindenberg's valediction in World Views. Norbert is the technical lead for Java internationalization at Sun and in this entry he summarizes the painful process of i18n in Java, including Unicode support. Considering the vast resources available for Java development you have to give Python's developers a lot of credit for doing so well with so few resources.

The most important thing for Pythoneers to remember is that if you go through the discipline of using Unicode properly in all your APIs, as I urge over and over again in my Python XML column, you will not even notice most of these snags. My biggest gripe with Python's implementation: confusion between code points and storage units in several of the "string" APIs, is addressed in practice (through theoretical fudge, to be precise) by always using Python compiled for UCS4 character storage, as I discussed at the beginning of "More Unicode Secrets". See that article's sidebar for more background on this issue.

Because Python currently has both legacy strings and Unicode objects and because the APIs for these overlap, there is some room for confusion (again, there was little practical choice considering reasonable requirements for backwards compatibility). The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects.

It is worth a word on the most common gripe: ASCII for the default site encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:

print <unicode>, <int>, <unicode>, <string>

Then to be safe you have to do:

print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),


out = codecs.getwriter(mylocale_encoding)(sys.stdout)
print >> out, <unicode>, <int>, <unicode>, <string>

or one of the other variations on this theme. The thing is that in Python using encoding assumed from locale would be a bit of an all-or-nothing problem, which means that the most straightforward solution to this need would unravel all the important compromises of Python's Unicode implementation, and cause portability problems. Just to be clear on what this specious solution is, here is the relevant quote from the Unicode PEP: # 100:

Note that the default site.py startup module contains disabled optional code which can set the according to the encoding defined by the current locale. The locale module is used to extract the encoding from the locale default settings defined by the OS environment (see locale.py). If the encoding cannot be determined, is unkown or unsupported, the code defaults to setting the to 'ascii'. To enable this code, edit the site.py file or place the appropriate code into the sitecustomize.py module of your Python installation.

Don't to this. I admit the wart, but I don't see it as fundamental. I just see it as a gap in the API. If we had a cleaner way of writing "print-according-to-locale", I think we could close that docket.

But the bottom line is that glibly saying that "Unicode sucks" or that "Python's Unicode sucks" because of these inevitable pitfalls is understandable as vented frustration when tackling programming problems (heck, integers suck, pixels suck, binary logic sucks, the Internet sucks, and so on: you get the point). But it's important for the rant readers not to get the wrong impression. Pay attention for just a minute:

  • Unicode solves a complex and hard problem, trying to make things as simple as possible, but no simpler
  • Python's Unicode implementation reflects the necessary complexity of Unicode, with the addition of compatibility and portability concerns
  • You can avoid almost all the common pitfalls of Python and Unicode by applying a discipline consisting of a handful of rules
  • If you insist on the quickest hack out of a Unicode-related problem, you're probably asking for a dozen other problems to eventually take its place

My recent articles on Unicode include a modest XML bent, but there's more than enough on plain old Python and Unicode for me to recommend them:

At the bottom of the first two are resources and references that cover just about everything you ever wanted or needed to know about Unicode in Python.

[Uche Ogbuji]

via Copia

Python/XML column #35: EaseXML and more on Unicode

"EaseXML: A Python Data-Binding Tool"

In this month's Python and XML column, Uche Ogbuji examines a new XML data-binding tool for Python: EaseXML. [Jul. 27, 2005]

The main focus of this article is EaseXML, another option for XML data binding. I found the package rather rough around the edges. I also included a section with a bit more on Unicode, which was the topic of the last two articles "Unicode Secrets" and "More Unicode Secrets". This time I introduced the unicodedata module, which provides useful information about characters from the Unicode standard database.

[Uche Ogbuji]

via Copia

Python/XML column #34 pubbed

"More Unicode Secrets"

In this month's Python and XML column, Uche Ogbuji continues his discussion of Unicode secrets with regard to XML processing in Python, especially BOMs and stream objects. [Jun. 15, 2005]

In the previous article I discussed Unicode compliance in XML tools, and discussed the Python APi for converting strings to Unicode objects, and vice versa. In this one i focus on file ans stream APIs, including a bit of Byte Order Mark (BOM) 101.

[Uche Ogbuji]

via Copia

Python/XML column #33 pubbed

"Unicode Secrets"

In his latest Python-XML column, Uche Ogbuji delves broadly and deeply into the world of Unicode, especially with regard to processing XML in Python.

In this one I started out talking about a quick spot check for Unicode compliance in XML tools, then went on to present some tips on Python's Unicode API. The intent was not to be comprehensive. I cherry-picked the particular Unicode facilities I tend to use the most. As one person mentioned in the comments, there are even more means at your disposal than I cover. I'll get to some of them in part 2, in the next column installment.

[Uche Ogbuji]

via Copia

To die for every day

The other day a friend who happened to check out Copia told me "I really liked that quote-to-die bit". It took me a moment to realize she really meant "Quotidie". Her pronunciation had never even occurred to me (I guess I lack imagination).

"Quotidie" is, of course, Latin for "daily". There is meant to be a pun on the fact that a quote heads each entry ("Quote-a-day")., but this is a bit incidental. I certainly pronounce the o as in "ought", the "tid" as in "teed off", and the ending almost to rhyme with in "dee-jay" (much less emphasis on the "dee"). I'm too lazy to bust out the proper IPA for it.

Anyway, in hopes that it will prevent any misunderstanding, I'll use the syllable length markers in the title from now, with macrons over the first i and the e ("Quotīdiē").

I guess as Michael Kaplan would say:

This post brought to you by the letters "ī" (U +012B, a.k.a. LATIN SMALL LETTER I WITH MACRON) and "ē" (U +012B, a.k.a. LATIN SMALL LETTER E WITH MACRON)

[Uche Ogbuji]

via Copia