Recently there has been a spate of kvetching about Unicode, and in some
cases Python's Unicode implementation. Some people feel that it's all
too complex, or too arbitrary. This just boggles my mind. Have people
chosen to forget how many billions of people there are on the planet?
How many languages, cultures and writing systems are represented in this
number? What the heck do you expect besides complex and arbitrary? Or
do people think it would be better to exclude huge sections of the
populace from information technology just to make their programmer lives
that marginal bit easier? Or maybe people think it would be easier to
deal separately with each of the hundreds of local encoding text schemes
that were incorporated into Unicode? What the hell?
Listen. Unicode is hard. Unicode will melt your brain at times, it's
the worst system out there, except for all the alternatives. (Yeah,
yeah, the old chestnut). But if you want to produce software that is
ready for a global user base (actually, you often can't escape the need
for character i18n even if you're just writing for your hamlet), you
have little choice but to learn it, and you'll be a much better
developer once you've learned it properly. Read it from the
well-regarded Joel Spolsky: "The Absolute
Minimum Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses!)". And remember, as Joel
says, "There Ain't No Such Thing As Plain
Text".
As for Python, that language has an excellent Unicode
implementation. A group of brilliant contributors tackled a barge load
of pitfalls and complexities involved in shunting full Unicode support
into the workings of a language that was already in heavy use. I think
they nailed most of the compromises, and again a lot of the things that
people rail at as arbitrary and obstructing are so precisely because
other (in some cases superficially attractive) alternatives are even
worse. Almost all languages that support Unicode have snags of their
own (see above: Unicode is hard). It's instructive to read Norbert
Lindenberg's valediction in World
Views.
Norbert is the technical lead for Java internationalization at Sun and
in this entry he summarizes the painful process of i18n in Java,
including Unicode support. Considering the vast resources available for
Java development you have to give Python's developers a lot of credit
for doing so well with so few resources.
The most important thing for Pythoneers to remember is that if you go
through the discipline of using Unicode properly in all your APIs, as I
urge over and over again in my Python XML
column, you will not even notice most of
these snags. My biggest gripe with Python's implementation: confusion
between code points and storage units in several of the "string" APIs,
is addressed in practice (through theoretical fudge, to be precise) by
always using Python compiled for UCS4 character storage, as I discussed
at the beginning of "More Unicode
Secrets". See that
article's sidebar for more background on this issue.
Because Python currently has both legacy strings and Unicode objects and
because the APIs for these overlap, there is some room for confusion
(again, there was little practical choice considering reasonable
requirements for backwards compatibility). The solution is discipline.
We long ago went through and adopted a coherent Unicode policy for
4Suite's core libraries. It was painful, but eliminated a great number
of issues. Martijn Faassen
mentions
similar experience in one of his projects.
It is worth a word on the most common gripe: ASCII for the default site
encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:
print <unicode>, <int>, <unicode>, <string>
Then to be safe you have to do:
print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),
or:
out = codecs.getwriter(mylocale_encoding)(sys.stdout)
print >> out, <unicode>, <int>, <unicode>, <string>
or one of the other variations on this theme. The thing is that in
Python using encoding assumed from locale would be a bit of an
all-or-nothing problem, which means that the most straightforward
solution to this need would unravel all the important compromises of
Python's Unicode implementation, and cause portability problems. Just
to be clear on what this specious solution is, here is the relevant
quote from the Unicode PEP: #
100:
Note that the default site.py startup module contains disabled
optional code which can set the according to
the encoding defined by the current locale. The locale module is
used to extract the encoding from the locale default settings
defined by the OS environment (see locale.py). If the encoding
cannot be determined, is unkown or unsupported, the code defaults
to setting the to 'ascii'. To enable this
code, edit the site.py file or place the appropriate code into the
sitecustomize.py module of your Python installation.
Don't to this. I admit the wart, but I don't see it as fundamental.
I just see it as a gap in the API. If we had a cleaner way of writing
"print-according-to-locale", I think we could close that docket.
But the bottom line is that glibly saying that "Unicode sucks" or that
"Python's Unicode sucks" because of these inevitable pitfalls is
understandable as vented frustration when tackling programming problems
(heck, integers suck, pixels suck, binary logic sucks, the Internet
sucks, and so on: you get the point). But it's important for the rant
readers not to get the wrong impression. Pay attention for just a
minute:
- Unicode solves a complex and hard problem, trying to make things as
simple as possible, but no simpler
- Python's Unicode implementation reflects the necessary complexity of
Unicode, with the addition of compatibility and portability concerns
- You can avoid almost all the common pitfalls of Python and Unicode by
applying a discipline consisting of a handful of rules
- If you insist on the quickest hack out of a Unicode-related problem,
you're probably asking for a dozen other problems to eventually take its
place
My recent articles on Unicode include a modest XML bent, but there's
more than enough on plain old Python and Unicode for me to recommend
them:
At the bottom of the first two are resources and references that cover
just about everything you ever wanted or needed to know about Unicode in
Python.
[Uche Ogbuji]