Index ¦ Archives ¦ Atom

Why Character Encoding Sucks In Your Language Part 1: Python

Character encoding is one of those topics many software engineers would prefer to have a very minimal understanding of. It's perceived as tricky, uninteresting, and is pretty much an afterthought for most projects that aren't internationalized. This lack of attention leads to tricky bugs that find a way around traditional tests (this function works for the string "hello world", so I guess it works!). In fact, if you love breaking things, you should definitely learn about character encoding, because it gives you powers to find bugs in lots of production software.

Sometimes for historical reasons, and sometimes under the guise of "simplifying" a difficult topic, our programming toolbox does anything but help the situation. In this blog series, I'm going to assert that the APIs around character encoding in modern languages are completely broken, starting with Python.

Python 2.x

The overloaded str object

Probably the biggest problem with Python 2's text encoding boils down to a naming problem. In Python 2, you have a datatype called str and one called unicode. These names are simply wrong. A str object is basically an array of bytes (numbers between 0 to 255). For any str object in your code, it could either be the bytes of an encoded string, or the decoded bytes of an ASCII string.

If you don't know the difference between these two uses, consider the situation of a character encoding that isn't ASCII compatible. If you have a string and you'd like to encode it into that encoding, calling mystring.encode("chadcoding") on your str returns another str. It's up to you to keep track of which str objects are being used as byte arrays and which are being used as strings.

So, the solution is that str should really be called bytes and unicode should really be called str, and the fact that internally it uses Unicode should be nothing more than an implementation detail. Of course, this is one of the several changes made in the Python 3.x series. It's just that "well it's fixed in a later version" isn't the greatest consolation given that most Python programmers spend their time in 2.x.

Broken Encoding and Decoding APIs

As a result of the ambiguity of the str object, there are some really really bad APIs around it. In particular, str and unicode both have encode and decode functions. This is what gets people in trouble. If you've ever witnessed an error like this:

 >>> u"\u00ff".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python-2.7.2/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)

then you've been a victim of this bad design decision. str should not have an encode function, and unicode definitely shouldn't have a decode function. The above stack trace is especially sinister. What's bad in particular is that this "works":

>>> u"hello world".decode("utf-8")
u'hello world'

What's specially interesting about the first stack trace is you'll see the error happens when trying to encode, even though your code never called encode! Well, Python is doing something a little presumptuous here. It says, "I see you're calling decode on a unicode object. That makes no sense, so I'll implicitly first encode your unicode object into a str using ASCII as the encoding and then use that str to decode". If the unicode object just so happens to be all ASCII characters, then encoding to ASCII will work fine, otherwise you'll get the confusing exception.

Basically, if you use unicode.decode in your code, there's a 100% chance of a bug when dealing with international characters but the code will seem to work fine until then. If that isn't a broken API, I don't know what is.

Counting Characters

For historical reasons, it was once believed 16 bits was enough to represent all "code points" in Unicode. This is no longer the case, but one side effect that remains with us is that several language implementations store strings in 2-byte "code units". What this often means is that characters that can't be represented in 2 bytes (in practice, these are any non-BMP characters) are instead represented with 4 bytes using a "surrogate pair".

In other words, internally, not all characters are the same length. Doing len on a unicode object or a str in Python 3-3.2 won't count the number of characters, but instead, the number of code units in the string. This behavior was changed in Python 3.3 as a part of PEP 393. Consider:

Python 2.7.2 (default, Aug  1 2011, 14:45:00)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u"\U0001D49E"
>>> print s
𝒞
>>> len(s)
2
>>> s[0]
u'\ud835'
>>> s[1]
u'\udc9e'

There's one more "gotcha". The above paragraphs are true for Python only when it was compiled with 2-byte character representations. The compile option --enable-unicode can be set to ucs2 or ucs4. Compiling with ucs4 means len will return what you expect, but this is oversimplifying string lengths. Combining characters and diacritical marks make the notion of a string's "length" non-obvious, but that's a subject for another blog. You can find out the compile option at runtime with:

import sys
PYTHON_UCS2 = sys.maxunicode == 65535

Related to this, there's no way in Python 2.x-3.2 (when compiled with UCS2) to iterate over the codepoints of a string. Using a for-loop will instead iterate over the code units which isn't likely to be what you want.

Things Python Did Right

Adding new a new encoding to Python is super easy. Just use the codecs module and register a "search" function. What also works great is the 2nd argument to encode:

u"mystring \u00ff".encode("ascii", "strict")  # throws UnicodeEncodeError
u"mystring \u00ff".encode("ascii", "replace") # "mystring ?"
u"mystring \u00ff".encode("ascii", "ignore")  # "mystring "

this makes it easy to determine if some string is encodable in some encoding; simply catch the UnicodeEncodeError.

Finally, one thing Python did right was to have ASCII as the lowest common denominator, instead of Latin-1, like some other languages. This works out because it means if any byte is greater than 127, we can throw an error to require explicit instead of guessing an encoding. This is seen in source file encoding and in byte strings in Python 2. For example:

>>> "\xff".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

If Python had decided Latin-1 would be the lowest common denominator, this snippet would have encoded "ÿ" into UTF-8. If you wanted this behavior, you'd be forced to do it explicitly, and in my opinion, more correctly.

>>> "\xff".decode("latin-1").encode("utf-8")
'\xc3\xbf'

which is the UTF-8 for ÿ.

Python 3

Python 3 takes the bullet of backwards incompatibility to fix things like character encoding. So, how fixed is it? It's actually pretty good. The ambiguity of str is fixed, and decode doesn't exist on it. Similarly, bytes do not have an encode method. Pasting unicode characters into the REPL works much better, although some regressions exist for this in 3.3.

Final Score

Python 2: C

Python 3: A-

© Chad Selph. Built using Pelican. Theme by Giulio Fidente on github.