Confusion over Python storage form for Unicode

I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.

You can't neatly categorize Python Unicode storage format by platform, either. True Jython is presently limited to UTF-16 storage, but you can compile CPython to use either UCS-2 or UCS-4 on any platform. To do so configure Python with the command --enable-unicode=ucs4. To check whether your Python is a UCS-4 build check that `sys.maxunicode > 65536`. I would love to say that you don't have to worry whether your Python uses UCS-2 or UCS-4. If you're communicating between Python tools you should be using abstract Unicode objects which would be seamlessly portable. The problem is that, as I warn at every opportunity, there are serious problems with how Python core libraries handle certain characters in UCS-2 builds, because of the lack of respect for surrogate pairs. It is for this reason that I advise CPython users to always use UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32) is almost always a waste of space, but wasting space is better than munging characters.

For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.

Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.

[Uche Ogbuji]

via Copia

2 responses

No. CPython's 16-bit Unicode support is UTF-16; in the Language Reference, it specifically states:

"Surrogate pairs may be present in the Unicode object, and will be reported as two separate items."

which is UTF-16 behaviour, not UCS2.

What's more, if you're arguing that (for instance) len(u'\U00010800') should always be 1, you're wrong. It should be 2 on "UCS2" Python because that is the number of code units in the string, and more code depends on the number of code units than depends on the number of Unicode code points.

Also, using UCS4 does not gain you anything here because of combining characters. You can still come up with a combining character sequence that displays to the user as a single logical character (in Unicode, a "grapheme cluster"), but that consists of more than one code point. You already have to handle this if you're writing code to process Unicode strings, so handling surrogate pairs as well is no additional work.

— Alastair Houghton

The space inside is Levitra Online a brilliant blank canvas of old brick, hinting at the building’s industrial past. It forms an intimate venue for classical kamagra concerts, allowing performing and audience spaces to blur and letting the audience get up-close with the artists, making for Phentermine an unusual and exclusive listening experience.

— darrymor