I'm writing up some notes on Henri Sivonen's article, "HOWTO Avoid Being Called a Bozo When Producing XML". For the most part it's an emphatic "wot he said", with some clarification in order on certain points. One of those is Python-specific. In the section "Use unescaped Unicode strings in memory" he says:
Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use
java.lang.String
andchar[]
and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.
A Jython build does use Java's internal Unicode data type, and thus UTF-16, but a CPython build will either store characters as UCS-2 or UCS-4. Option one is UCS-2, not UTF-16. The two are so close that one might think the distinction pedantic, except that I've seen multiple users tripped up by the fact that CPython's internal format under the first option does not respect surrogate pairs, which would be required if it were UTF-16. Option two is UCS-4, not UTF-32, although the difference in this case truly is academic and probably would only affect people using certain chunks of Private Use Areas.
You can't neatly categorize Python Unicode storage format by platform,
either. True Jython is presently limited to UTF-16 storage, but you can
compile CPython to use either UCS-2 or UCS-4 on any platform. To do so
configure Python with the command --enable-unicode=ucs4
. To check
whether your Python is a UCS-4 build check that `sys.maxunicode >
65536`. I would love to say that you don't have to worry whether your
Python uses UCS-2 or UCS-4. If you're communicating between Python tools
you should be using abstract Unicode objects which would be seamlessly
portable. The problem is that, as I warn at every opportunity, there are
serious problems with how Python core libraries handle certain
characters in UCS-2 builds, because of the lack of respect for surrogate
pairs. It is for this reason that I advise CPython users to always use
UCS-4 builds, if possible. It's unfortunate that UCS-4 (and even UTF-32)
is almost always a waste of space, but wasting space is better than
munging characters.
For more on all this, see my post "alt.unicode.kvetch.kvetch.kvetch", and especially see the articles I link to on Python/Unicode.
Overall disclaimer: I certainly don't claim to be without my own limitations in understanding and remembering the vagaries of Unicode, and of its Python implementation, so I bet someone will jump in with some correction, but I think I have the core, practical details right, whereas I think Henri's characterization was confusing.