Unicode

More than you ever wanted to know about Unicode.

Python

Everything described here applies to Unicode in Python versions up to 2.6. In Python 3, a lot of good changes have been made (namely all strings are Unicode, and the default encoding Python assumes is utf-8). But before the world updates to Python 3, it's simply necessary to be a bit careful.

I have always found the Unicode methods confusing. The confusion, for me, lay in confusing the sense of encoding / decoding; Initially I thought of "encoding" as meaning "making" Unicode, and "decoding" as going back out of Unicode. In fact, this is exactly opposite. A "Unicode" string in Python could better be thought of as "un-coded" or at least "coding-neutral".

Encoded text is that which has already been translated into actual bytes of data with a particular encoding scheme, like "latin-1" or "utf-8". Note here that "utf-8" is a particular encoding defined as part of the Unicode standard, and thus not a "Unicode" string in the Python sense. Encoded text is "dumb" in the sense that the raw bytes of data have no inherent sense of how they have been encoded and working with them is "dangerous" in that you need to be aware of the encoding that has been employed -- as mixing different schemes, or using functions that might make assumptions about format other than what you have in mind, could produce wrong results. For this reason, Python dutifully, though at the most inconvenient of times, complains in the form of Unicode exceptions, when something less than crystal clear has been attempted.

So you decode "raw bytes" bytes of data into a proper Unicode object in Python. The resulting Unicode object is "smart" in the sense that in addition to the actual text data, the format is known. In this way, functions that work with Unicode objects are able to negotiate differences between formats, translating as necessary to say splice together parts of texts. Once you have Unicode objects, you don't have to "think" about what format things are, Python should handle the tricky parts.

Comparison to image formats

Imagine that Unicode is like a GIMP (or "Photoshop") document in terms of being a format that actually holds different things together (think layers, settings, etc) and is "smart" about keeping everything working. UTF-8, latin-1 (aka iso-8859-1, aka Western) and ASCII are like PNG, JPEG , and GIF, particular "encodings" that you actually save in a file or display in a web browser. Thus you "decode" from a particular format (JPEG, PNG) into the umbrella Unicode to do your work, then finally "encode" or save as... using a particular format. In many cases you may choose to work with UTF-8 as both the input and output encoding, but it's important to understand the difference between UTF-8 (a particular encoding) and Unicode (the overarching system to holds everything together). In Python, Unicode are objects, while any actual input and output (reading from / to a file, or to stdout) is always done using a particular encoding.

When Python complains by throwing a Unicode exception, it's like the warning your image editor would give when trying to save an image that has transparency or layers as a JPEG (which doesn't support these features). Python forces you to make a decision and refuses to do anything that might produce incorrect or missing data.

The golden rules

The "Unicode Lifecycle" has been usefully summarized by Kumar McMillan in a talk at PyCon 2008 [1] as follows:

Decode early
Unicode everywhere
Encode late

Decode Early

Turn those tricky raw bytes of text into Unicode as soon as you get them. Use the string decode function, along with the format you know the bytes have been encoded with, based on the source.

str = get_from_latin1_encoded_database("name")
ustr = str.decode("latin-1")

Many python libraries, like BeautifulSoup or Feedparser already deliver text as Unicode objects to you, so you don't need to decode anything again.

Unicode everywhere

As long as everything you are using is Unicode, Python should be able to handle everything without a single dreaded Unicode Exeception.

letter = u"Chère Madame %s ..." % ustr

Encode late

Turn Unicode back into raw bytes for output. Use the encode method of the Unicode object, and give the format desired/required by the output (a Terminal, a database, a webpage).

# output as part of a utf-8 encoded webpage...
print "Content-type: text/html; charset=utf-8"
print
print letter.encode("utf-8")

Reading from a file

In Python, the codec module has a file open function that takes an encoding option to indicate the format of the file. This option simply sets how python interprets the file data, it doesn't actively apply any coding or do any kind of conversion (yet). It is up to you to know/ensure the format of the file you are opening.

import codecs
f = codecs.open("myfile.txt", encoding='utf-8')
for line in f:
    print repr(line)

Note that in the above example, calling the repr function means that the unicode gets displayed with escaped special characters (and thus will display with no problems on any kind of Terminal as it's ASCII.

To show the actual contents of the file, you would then encode the text to match the encoding scheme of your Terminal, so in the (likely) case that your terminal is (also) set to utf-8:

import codecs
f = codecs.open("myfile.txt", encoding='utf-8')
for line in f:
    print line.encode("utf-8")

Or if your terminal was "latin-1":

import codecs
f = codecs.open("myfile.txt", encoding='utf-8')
for line in f:
    print line.encode("latin-1")

Writing to a file

Just like reading from a file, writing special characters to a file uses the open command in the codecs module.

import codecs
txt = codecs.open("data.txt", "w", encoding='utf-8')

Now you can safely print special characters (make sure you have decoded early to Unicode objects), as in:

print >> txt, u"c'est très bon, unicode!"

or using file.write:

txt.write(data + "\n")

Checking the encoding of a file

You can use the file command to display the encoding of a file, if it can be determined.

file ducasse_poésies.txt

ducasse_poésies.txt: ISO-8859 English text, with CRLF line terminators

Including non-ascii characters in your code, specifiying the encoding of your code files

When you use special (non-ascii) characters in your source code, it is necessary to tell the python interpreter what encoding you are using to save the file. THis is done with a special comment line, similar to the "she-bang" used in a CGI (you may often see the two used together in fact, with the she-bang first (since it must be the first line), and the encoding line second.

http://www.python.org/dev/peps/pep-0263/

#!/usr/bin/python
#-*- coding:utf-8 -*-

print u"c'est très bon, unicode!"

Making Regular Expressions work well with international characters

Using Python's locale module allows you to make the regular expression module smarter about, for instance, what characters are considered valid "words".