alt.unicode.kvetch.kvetch.kvetch

Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have people chosen to forget how many billions of people there are on the planet? How many languages, cultures and writing systems are represented in this number? What the heck do you expect besides complex and arbitrary? Or do people think it would be better to exclude huge sections of the populace from information technology just to make their programmer lives that marginal bit easier? Or maybe people think it would be easier to deal separately with each of the hundreds of local encoding text schemes that were incorporated into Unicode? What the hell?

Listen. Unicode is hard. Unicode will melt your brain at times, it's the worst system out there, except for all the alternatives. (Yeah, yeah, the old chestnut). But if you want to produce software that is ready for a global user base (actually, you often can't escape the need for character i18n even if you're just writing for your hamlet), you have little choice but to learn it, and you'll be a much better developer once you've learned it properly. Read it from the well-regarded Joel Spolsky: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". And remember, as Joel says, "There Ain't No Such Thing As Plain Text".

As for Python, that language has an excellent Unicode implementation. A group of brilliant contributors tackled a barge load of pitfalls and complexities involved in shunting full Unicode support into the workings of a language that was already in heavy use. I think they nailed most of the compromises, and again a lot of the things that people rail at as arbitrary and obstructing are so precisely because other (in some cases superficially attractive) alternatives are even worse. Almost all languages that support Unicode have snags of their own (see above: Unicode is hard). It's instructive to read Norbert Lindenberg's valediction in World Views. Norbert is the technical lead for Java internationalization at Sun and in this entry he summarizes the painful process of i18n in Java, including Unicode support. Considering the vast resources available for Java development you have to give Python's developers a lot of credit for doing so well with so few resources.

The most important thing for Pythoneers to remember is that if you go through the discipline of using Unicode properly in all your APIs, as I urge over and over again in my Python XML column, you will not even notice most of these snags. My biggest gripe with Python's implementation: confusion between code points and storage units in several of the "string" APIs, is addressed in practice (through theoretical fudge, to be precise) by always using Python compiled for UCS4 character storage, as I discussed at the beginning of "More Unicode Secrets". See that article's sidebar for more background on this issue.

Because Python currently has both legacy strings and Unicode objects and because the APIs for these overlap, there is some room for confusion (again, there was little practical choice considering reasonable requirements for backwards compatibility). The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects.

It is worth a word on the most common gripe: ASCII for the default site encoding rather than UTF-8. Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python. I recommend it overall, but in particular see pages 23 through 26 for discussion of this trade-off. One thing that would be nice is if print could be smarter about the output encoding. Right now, if you're trying to do:

print <unicode>, <int>, <unicode>, <string>

Then to be safe you have to do:

print <unicode>.encode("utf-8"), <int>, <unicode>.encode("utf-8"),

or:

out = codecs.getwriter(mylocale_encoding)(sys.stdout)
print >> out, <unicode>, <int>, <unicode>, <string>

or one of the other variations on this theme. The thing is that in Python using encoding assumed from locale would be a bit of an all-or-nothing problem, which means that the most straightforward solution to this need would unravel all the important compromises of Python's Unicode implementation, and cause portability problems. Just to be clear on what this specious solution is, here is the relevant quote from the Unicode PEP: # 100:

Note that the default site.py startup module contains disabled optional code which can set the according to the encoding defined by the current locale. The locale module is used to extract the encoding from the locale default settings defined by the OS environment (see locale.py). If the encoding cannot be determined, is unkown or unsupported, the code defaults to setting the to 'ascii'. To enable this code, edit the site.py file or place the appropriate code into the sitecustomize.py module of your Python installation.

Don't to this. I admit the wart, but I don't see it as fundamental. I just see it as a gap in the API. If we had a cleaner way of writing "print-according-to-locale", I think we could close that docket.

But the bottom line is that glibly saying that "Unicode sucks" or that "Python's Unicode sucks" because of these inevitable pitfalls is understandable as vented frustration when tackling programming problems (heck, integers suck, pixels suck, binary logic sucks, the Internet sucks, and so on: you get the point). But it's important for the rant readers not to get the wrong impression. Pay attention for just a minute:

Unicode solves a complex and hard problem, trying to make things as simple as possible, but no simpler
Python's Unicode implementation reflects the necessary complexity of Unicode, with the addition of compatibility and portability concerns
You can avoid almost all the common pitfalls of Python and Unicode by applying a discipline consisting of a handful of rules
If you insist on the quickest hack out of a Unicode-related problem, you're probably asking for a dozen other problems to eventually take its place

My recent articles on Unicode include a modest XML bent, but there's more than enough on plain old Python and Unicode for me to recommend them:

At the bottom of the first two are resources and references that cover just about everything you ever wanted or needed to know about Unicode in Python.

[Uche Ogbuji]

via Copia

11 responses

Good points, I get tired of blog authors jumping on some negative bandwagon and that's the only thing you see for weeks on end.

— saluk

It is hard. One of the things that makes it hard is that coherent encoding spans systems and can't easily be dealt in one place - not unlike security issues. "Unicode firewalling" is the meme running round my head at the moment :)

The other thing that makes it hard is explaining the potentially complexity costs to customers without melting their brains with text arcana. Thankfully, Glyph Lefkowitz has come up with a great analogy to image codecs: http://www.livejournal.com/users/glyf/39250.html

— Bill de hOra

Of course there is such a thing as plain text. It's a sequence of Unicode characters without markup. :-)

— John Cowan

"Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind."

"The solution is discipline. We long ago went through and adopted a coherent Unicode policy for 4Suite's core libraries. It was painful, but eliminated a great number of issues. Martijn Faassen mentions similar experience in one of his projects."

Hang on. You admit that the Python Unicode shortcomings cause significant pain to developers, but when people complain about it, it "boggles your mind"? What's so mind-boggling? It causes pain. It doesn't have to. So people complain. That seems perfectly reasonable to me.

— Jim

Jim,

No. I'm not sure where you read that. It was not in my Weblog entry.

First of all, I only admitted to two nits in Python's Unicode. (1) the code point / storage unit mix-up in UCS2 builds (2) the lack of no print-to-local for the sorts of quick & dirtry work that print is usually used for.

(1) has an easy fix: use UCS4 builds. Even if you can't, few developers will run into that problem. (2) is yet another minor "wouldn't it be nice if..." API proposal. Neither of these adds up to what could be called "shortcomings that cause significant pain", and neither is any worse than what you'll find in other languages.

And I also address the "it doesn't have to" bit. I say that Unicode is hard over and over again. And even so, as I also say over and over again, with a modest amoint of discipline and experience you can avoid having to deal with much of the complexity.

The kvetching does not seem the least bit reasonable to me, especially since a lot of the kvetching betrays poor knowledge of Unicode and the decision points that led to Python's design choices. They also propose solutions that are more broken than the status quo. So yes, it boggles my mind.

— Uche

Copia Another excellent quote from Uche... Recently there has been a spate of kvetching about Unicode, and in some cases Python's Unicode implementation. Some people feel that it's all too complex, or too arbitrary. This just boggles my mind. Have...

— Trackback from Quote of the Day

In addition to the resources in the articles I mentioned, AMK just posted a new Unicode HOWTO:

http://www.amk.ca/python/howto/unicode

It retreads a lot of territory from other documents, and I don't know whether the idea is to consolidate and make it the One Blessed Document for Python Unicode.

IMO, Unicode strings changing nature depending on how the interpreter was compiled is a design bug.

Using UTF-32 builds is not an easy fix. Python does not exist in a vacuum. Jython, IronPython and PyObjC need to bridge strings with environments where UTF-16 is pretty much cast in concrete.

Even if using UTF-16 as the programmer-visible abstraction is kind of uncool compared to UTF-32, I think having the abstraction locked to UTF-16 is better that having the abstraction change underneath your code.

— Henri Sivonen

Henri,

I could easily agree with you, as long as Python's API is "fixed" to be useful in a UCS2 build. I think it's crazy to ask users to have to check surrogates themselves just to get an accurate string length. As I admitted, UCS4 is no more than a hacky workaround, but it does make the very APIs work in a sane manner.

Unfortunately, Python core developers do not want to fix the storage-unit/code-point issue, last I checked.

Recently there has been a spate of kvetching about Unicode, and in some

>cases Python's Unicode implementation. Some people feel that it's all

>too complex, or too arbitrary. This just boggles my mind. Have people

>chosen to forget how many billions of people there are on the planet?

>How many languages, cultures and writing systems are represented in this

>number? What the heck do you expect besides complex and arbitrary? Or

>do people think it would be better to exclude huge sections of the

>populace from information technology just to make their programmer lives

>that marginal bit easier? Or maybe people think it would be easier to

>deal separately with each of the hundreds of local encoding text schemes

>that were incorporated into Unicode? What the hell?

I do not agree.For more info go to http://www.apartments.waw.pl

— warsaw hotels

Hang on. You admit that the Python Unicode shortcomings cause significant pain to developers, but when people complain about it, it "boggles your mind"? What's so mind-boggling? It causes pain. It doesn't have to. So people complain. That seems perfectly reasonable to me.

http://www.addlink.pl/strona,pozycjonowanie_reklama_stron_internetowych,322.html

— Pozycjonowanie