Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have people chosen to forget how many billions of people there are on the planet? How many languages, cultures and writing systems are represented in this number? What the heck do you expect besides complex and arbitrary? Or do people think it would be better to exclude huge sections of the populace from information technology just to make their programmer lives that marginal bit easier? Or maybe people think it would be easier to deal separately with each of the hundreds of local encoding text schemes that were incorporated into Unicode? What the hell?
Listen. Unicode is hard. Unicode will melt your brain at times, it's the worst system out there, except for all the alternatives. (Yeah, yeah, the old chestnut). But if you want to produce software that is ready for a global user base (actually, you often can't escape the need for character i18n even if you're just writing for your hamlet), you have little choice but to learn it, and you'll be a much better developer once you've learned it properly. Read it from the well-regarded Joel Spolsky: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". And remember, as Joel says, "There Ain't No Such Thing As Plain Text".
As for Python, that language has an excellent Unicode implementation. A group of brilliant contributors tackled a barge load of pitfalls and complexities involved in shunting full Unicode support into the workings of a language that was already in heavy use. I think they nailed most of the compromises, and again a lot of the things that people rail at as arbitrary and obstructing are so precisely because other (in some cases superficially attractive) alternatives are even worse. Almost all languages that support Unicode have snags of their own (see above: Unicode is hard). It's instructive to read Norbert Lindenberg's valediction in World Views. Norbert is the technical lead for Java internationalization at Sun and in this entry he summarizes the painful process of i18n in Java, including Unicode support. Considering the vast resources available for Java development you have to give Python's developers a lot of credit for doing so well with so few resources.
The most important thing for Pythoneers to remember is that if you go through the discipline of using Unicode properly in all your APIs, as I urge over and over again in my Python XML column, you will not even notice most of these snags. My biggest gripe with Python's implementation: confusion between code points and storage units in several of the "string" APIs, is addressed in practice (through theoretical fudge, to be precise) by always using Python compiled for UCS4 character storage, as I discussed at the beginning of "More Unicode Secrets". See that article's sidebar for more background on this issue.
Because Python currently has both legacy strings and Unicode objects and because the APIs for these overlap, there is some room for confusion (again, there was little practical choice considering reasonable requirements for backwards compatibility). The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects.
It is worth a word on the most common gripe: ASCII for the default site encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:
print <unicode>, <int>, <unicode>, <string>
Then to be safe you have to do:
print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),
or:
out = codecs.getwriter(mylocale_encoding)(sys.stdout) print >> out, <unicode>, <int>, <unicode>, <string>
or one of the other variations on this theme. The thing is that in Python using encoding assumed from locale would be a bit of an all-or-nothing problem, which means that the most straightforward solution to this need would unravel all the important compromises of Python's Unicode implementation, and cause portability problems. Just to be clear on what this specious solution is, here is the relevant quote from the Unicode PEP: # 100:
Note that the default site.py startup module contains disabled optional code which can set the according to the encoding defined by the current locale. The locale module is used to extract the encoding from the locale default settings defined by the OS environment (see locale.py). If the encoding cannot be determined, is unkown or unsupported, the code defaults to setting the to 'ascii'. To enable this code, edit the site.py file or place the appropriate code into the sitecustomize.py module of your Python installation.
Don't to this. I admit the wart, but I don't see it as fundamental. I just see it as a gap in the API. If we had a cleaner way of writing "print-according-to-locale", I think we could close that docket.
But the bottom line is that glibly saying that "Unicode sucks" or that "Python's Unicode sucks" because of these inevitable pitfalls is understandable as vented frustration when tackling programming problems (heck, integers suck, pixels suck, binary logic sucks, the Internet sucks, and so on: you get the point). But it's important for the rant readers not to get the wrong impression. Pay attention for just a minute:
- Unicode solves a complex and hard problem, trying to make things as simple as possible, but no simpler
- Python's Unicode implementation reflects the necessary complexity of Unicode, with the addition of compatibility and portability concerns
- You can avoid almost all the common pitfalls of Python and Unicode by applying a discipline consisting of a handful of rules
- If you insist on the quickest hack out of a Unicode-related problem, you're probably asking for a dozen other problems to eventually take its place
My recent articles on Unicode include a modest XML bent, but there's more than enough on plain old Python and Unicode for me to recommend them:
At the bottom of the first two are resources and references that cover just about everything you ever wanted or needed to know about Unicode in Python.