Unicode: Difference between revisions
(→Python) |
|||
Line 1: | Line 1: | ||
== Python == | == Python == | ||
I have always found the Unicode methods confusing. The confusion, for me, lay in confusing the sense of encoding / decoding; Initially I thought of "encoding" as meaning "making" Unicode, and "decoding" as going back out of Unicode. In fact, this is exactly opposite. A "Unicode" string in Python could better be thought of as "un-coded" or at least "coding-neutral". | |||
'''Encoded text''' is that which has already been translated into actual bytes of data with a particular encoding scheme, like "latin-1" or "utf-8". Note here that "utf-8" is a particular encoding defined as part of the Unicode standard, and thus '''not''' a "Unicode" string. In Python, encoded text is "dumb" in the sense that the raw bytes of data have no inherent sense of how they have been encoded and working with them is "dangerous" in that you need to be aware of the encoding that has been employed -- as mixing different schemes, or using functions that might make assumptions about format other than what you have in mind, could produce wrong results. For this reason, Python dutifully, though infuriatingly and seemingly always at inconvenient times, complains in the form of Unicode exceptions, when something unclear has been attempted. | |||
In contrast, one '''decodes''' to turn "raw bytes" bytes of data into a proper Unicode object in Python. The resulting Unicode object is "smart" in the sense that in addition to the actual text data, the format is known. In this way, functions that work with Unicode objects are able to negotiate differences between formats, translating as necessary to say splice together parts of texts. | |||
''' | |||
The "Unicode Lifecycle" is thus: | The "Unicode Lifecycle" is thus: |
Revision as of 13:28, 15 March 2009
Python
I have always found the Unicode methods confusing. The confusion, for me, lay in confusing the sense of encoding / decoding; Initially I thought of "encoding" as meaning "making" Unicode, and "decoding" as going back out of Unicode. In fact, this is exactly opposite. A "Unicode" string in Python could better be thought of as "un-coded" or at least "coding-neutral".
Encoded text is that which has already been translated into actual bytes of data with a particular encoding scheme, like "latin-1" or "utf-8". Note here that "utf-8" is a particular encoding defined as part of the Unicode standard, and thus not a "Unicode" string. In Python, encoded text is "dumb" in the sense that the raw bytes of data have no inherent sense of how they have been encoded and working with them is "dangerous" in that you need to be aware of the encoding that has been employed -- as mixing different schemes, or using functions that might make assumptions about format other than what you have in mind, could produce wrong results. For this reason, Python dutifully, though infuriatingly and seemingly always at inconvenient times, complains in the form of Unicode exceptions, when something unclear has been attempted.
In contrast, one decodes to turn "raw bytes" bytes of data into a proper Unicode object in Python. The resulting Unicode object is "smart" in the sense that in addition to the actual text data, the format is known. In this way, functions that work with Unicode objects are able to negotiate differences between formats, translating as necessary to say splice together parts of texts.
The "Unicode Lifecycle" is thus:
Decode Early
Turn raw bytes into Unicode as soon as you get them. Use the string decode function, along with the format you know the bytes have been encoded with, based on the source.
str = get_from_latin1_encoded_database("name")
ustr = str.decode("latin-1")
Unicode everywhere
As long as everything you are using is Unicode, Python should be able to handle everything without a single dreaded Unicode Exeception.
letter = u"Chère Madame %s ..." % ustr
Encode late
Turn Unicode back into raw bytes for output. Use the encode method of the Unicode object, and give the format desired/required by the output (a Terminal, a database, a webpage).
# output as part of a utf-8 encoded webpage...
print "Content-type: text/html; charset=utf-8"
print
print letter.encode("utf-8")