I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a
Bozo When Producing XML". For the most part it's an emphatic "wot he
said", with some clarification in order on certain points. One of those
is Python-specific. In the section "Use unescaped Unicode strings in
memory" he says:
Moreover, the chances for mistakes are minimized when in-memory
strings use the encoding of the built-in Unicode string type of the
programming language if your language (or framework) has one. For
example, in Java you’d use java.lang.String
and char[]
and,
therefore, UTF-16. Python has the complication that the Unicode string
type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on
how the interpreter was compiled. With C it makes sense to choose one
UTF and stick to it.
A Jython build does use Java's internal Unicode data type, and thus
UTF-16, but a CPython build will either store characters as UCS-2 or
UCS-4. Option one is UCS-2, not UTF-16. The two are so close that
one might think the distinction pedantic, except that I've seen multiple
users tripped up by the fact that CPython's internal format under the
first option does not respect surrogate pairs, which would be required
if it were UTF-16. Option two is UCS-4, not UTF-32, although the
difference in this case truly is academic and probably would only affect
people using certain chunks of Private Use Areas.
You can't neatly categorize Python Unicode storage format by platform,
either. True Jython is presently limited to UTF-16 storage, but you can
compile CPython to use either UCS-2 or UCS-4 on any platform. To do so
configure Python with the command --enable-unicode=ucs4
. To check
whether your Python is a UCS-4 build check that `sys.maxunicode >
65536`. I would love to say that you don't have to worry whether your
Python uses UCS-2 or UCS-4. If you're communicating between Python tools
you should be using abstract Unicode objects which would be seamlessly
portable. The problem is that, as I warn at every opportunity, there are
serious problems with how Python core libraries handle certain
characters in UCS-2 builds, because of the lack of respect for surrogate
pairs. It is for this reason that I advise CPython users to always use
UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32)
is almost always a waste of space, but wasting space is better than
munging characters.
For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch",
and especially see the articles I link to on Python/Unicode.
Overall disclaimer: I certainly don't claim to be without my own
limitations in understanding and remembering the vagaries of Unicode,
and of its Python implementation, so I bet someone will jump in with
some correction, but I think I have the core, practical details right,
whereas I think Henri's characterization was confusing.
[Uche Ogbuji]