Unicode

From XPUB & Lens-Based wiki

Python

For one reason or another, I have always found the Unicode methods confusing.

The confusion, for me, lay in confusing the sense of encoding / decoding; Initially I thought of "encoding" as meaning "making" Unicode, and "decoding" as going back out of Unicode. In fact, this is exactly opposite.

Encoded text is that which has already been translated into actual bytes of data with a particular encoding scheme, like "latin-1" or "utf-8". In Python, encoded text is "dumb" in the sense that the raw bytes of data have no inherant sense of how they have been encoded and working with them is "dangerous" in the sense that you need to be aware of the encoding that has been employed -- as mixing different schemes, or using functions with a different expectation, could produce wrong results. For this reason, Python dutifully, though infuriatingly for a beginner, often complains in the form of Unicode errors, when something unclear has been attempted.

Decoded text is that which has been wrapped into a proper Python Unicode string. When turning "raw bytes" of text into a Unicode object, using a string's decode function, you tell Python what format the text is in. The resulting Unicode object is "smart" in the sense that in addition to the actual text data, the format is also known. In this way, functions that work with Unicode objects are able to negotiate differences between formats, translating as necessary to say combine the contents of two strings of text.

The "Unicode Lifecycle" is thus:

Decode Early

Turn raw bytes into Unicode as soon as you get them. Use the string decode function, along with the format you know the bytes have been encoded with, based on the source.

str = get_from_latin1_encoded_database("name")
ustr = str.decode("latin-1")

Unicode everywhere

As long as everything you are using is Unicode, Python should be able to handle everything without a single dreaded Unicode Exeception.

label = u"Chère Madame %s" % ustr

Encode late

Turn Unicode back into raw bytes for display / other "final"/fixed uses. Use the encode method of the Unicode object, and give the format required by the output medium (the charset used by the Terminal, or a database, a webpage).

# output as part of a utf-8 encoded webpage...
print "Content-type: text/html; charset=utf-8"
print
print label.encode("utf-8")