Encodings and Unicode#

All data, whether numerical, text, images, or some other data type must be stored as ones and zeroes, and encodings are the conventions to make sure everyone agrees how to store and how to interpret strings of bytes.

Unicode is the most broadly applicable encoding for texts in all human languages. Even if you don’t use text as data, metadata usually includes text, and you may need to debug issues with getting text data into Python.

Note

bits are the smallest units of information storage; single devices that can be in one of two states, conventionally called 0 and 1.

A byte is a collection of 8 bits that can be interpreted as a number between 0 and 255.

hexadecimal notation is a way of writing numbers stored in bytes using the symbols 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,and F to stand for the numbers 0-15. Bytes are written as two-character symbols, where the first character counts 16s and the final character counts ones:

=  6*16 +  4 = 0x64
=  7*16 + 25 = 0x7F
=  8*16 +  0 = 0x80
= 15*16 + 10 = 0xFA

The “0x” is a convention to remind you that 0x64 is written in hexadecimal.

ASCII - the first 128 code points#

ASCII is a long-standing standard (published by the American Standards Association in 1963) that assigns bytes with number values between 0 and 127 to letters in the basic Latin alphabet, punctuation marks, and a handful of control characters, including TAB (escaped in Python as \t), carriage return CR (escaped in Python as \r) and line feed LF (escaped in Python as \n).

These 128 symbols have the widest support and will usually cause no problems.

The symbols come in a conventional order:

* control characters like newline and tab,
* some punctuation marks, including " ' +  and - 
* Arabic numerals,
* more punctuation,including < = ? and @, 
* the upper-case Latin letters,
* more punctuation, including [ \ and _,                              
* lower-case Latin letters, and finally, 
* final four punctuation marks { | } and ~ and a non-printing character DEL.  

This order controls string comparisons and string sorting operations by default. If you have ever sorted a list (such as a list of filenames) and found all the files beginning with capital letters appear before all of the files beginning with lower case letters, this sorting order is the reason.

This gives us about a hundred printable characters, corresponding to the keys on an English-language keyboard. This is somewhat limiting. Bytes have 256 possible values, why not use values 128-255 for something?

Latin-1 encoding - the rest of the characters?#

There are a handful of (now obsolete) encodings that assign the remaining byte values (128-255) to characters. The Latin-1 character encoding, also known as ISO-8859, is one such, developed for European languages using Latin alphabets. It provides another hundred or so characters:

Can I just make a string with four characters, with values \x43, \x61, \x66, and \xE9 (corresponding to C, a, f, and é) ? (\xNN is a Python escaping syntax for bytes by hexadecimal value NN).

Yes. This works, and gives me a four-character Python bytestring that I need to remember is encoded using Latin-1 encoding.

cafe_string_latin1 = b'\x43\x61\x66\xe9'
print(type(cafe_string_latin1))
print(len(cafe_string_latin1))

<class 'bytes'>
4

print(cafe_string_latin1)

b'Caf\xe9'

This string doesn’t print nicely; I have to decode the latin1 encoding to print it nicely:

cafe_string = cafe_string_latin1.decode("latin-1")

print(cafe_string)

Café

This has converted a “bytes” datatype into a [unicode] “string” datatype.

print(type(cafe_string ))
print(len(cafe_string ))

<class 'str'>
4

Unicode!#

Unicode is a system for encoding every human language, unifying computer representations of disparate languages.
The unicode standard defines code points or characters that include every letter or symbol you could want in essentially every alphabet. Unicode version 15 contains definitions for 149,000 characters, including 11,000 Hangul syllables, 98,000 Chinese characters, and a few thousand emojis. Unicode code points are written by convention as U + four-digit hexadecimal numbers.

Modern operating systems, keyboards, and browsers can render and use unicode in a wide variety of contexts; in the 21st century, you can create a file called 论文最终版.tex and expect it not to kill your thesis-building tools or your hard drive.

Python uses unicode to store strings, and the word “string” without qualification means a string that is permitted to contain any unicode sequence.

The word Café has four letters, and can be written as the sequence of four unicode symbols:

C U+0043  LATIN CAPITAL LETTER C
a U+0061  LATIN SMALL LETTER A
f U+0066  LATIN SMALL LETTER F
é U+00E9  LATIN SMALL LETTER E WITH ACUTE

Can I just make a string with four characters, with values \u0043, \u0061, \u0066, and \u00E9? Yes. This works, and gives me a four-character Python (unicode) string:

cafe_string = '\u0043\u0061\u0066\u00E9'
print(cafe_string)

Café

But I can just write the accented character in my source code and it creates the same four-character string, so why am I worrying?

print(cafe_string == "Café")

True

Encoding/decoding#

When it is time to save this string in a datafile (or save my source code) the unicode symbols are stored as bytes.

Although any content can be written as a sequence of unicode symbols, the way that unicode symbols are encoded in digital storage is not entirely trivial. The unicode symbols, called codepoints are the truth; the sequence of bytes that indicates a particular unicode symbol is the encoding. The most popular encoding is UTF-8, an encoding which uses between 1 and 4 bytes per codepoint, depending on the codepoint.

Python enforces a difference in data types between data that are interpreted as a sequence of bytes (bytes) without an encoding and sequences of characters where the encoding has been specified and decoded (called strings).

This shows us the five bytes that would be written in the utf-8 encoding:

print(bytes(cafe_string.encode("utf8")))

b'Caf\xc3\xa9'

UTF-8 encodes the unicode codepoint \u00E9 as the two bytes \xC3 and \xA9. This makes our encoded bytestring 5 characters long. (If you ever discover that the Google Translate API fails for submissions that are only 4950 unicode characters but more than 5000 bytes you will find that len() has a different behavior on string datatypes (where it counts characters) and bytes datatypes (where it counts bytes).

“UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.” Python Unicode howto

“The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence." 

(https://www.fileformat.info/info/unicode/utf8.htm)

A consequence of this variable-length encoding is that there are sequences of bytes that are invalid unicode encodings: sequences of random bytes will often have bytes that are not valid as the first byte of a multi-byte sequence, or have continuing bytes that do not have the right values and cannot be interpreted as valid code points. If Python is asked to decode bytes into unicode (common when reading files or downloading content from the internet) and finds a problem, you get a UnicodeDecodeError. This is the most common kind of difficulty you will have with unicode: converting to and from utf8 when needed, and occasionally converting from formats other than utf-8.

The Latin-1 encoding has the virtue that it will interpret every byte (and every sequence of bytes) as a valid string; there are no “rules” that can be violated when every byte is a valid symbol. If you encounter a UnicodeDecodeError on some data, as a first attempt at debugging, try decoding the data using latin-1. If the data is not obviously corrupted, it is likely the input data was not encoded in utf-8 to start with.

If we try to decode latin-1 data as utf-8, we get a UnicodeDecodeError:

cafe_string_latin1.decode("utf8")

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[9], line 1
----> 1 cafe_string_latin1.decode("utf8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data

Decoding latin-1 bytestrings correctly as latin-1 turns them back into Python strings:

cafe_string_latin1.decode("latin-1")

'Café'

And this matches our earlier Python string that was born in unicode.

cafe_string_latin1.decode("latin-1") == cafe_string

True

Making the converse mistake, decoding UTF-8 strings with the latin-1 decoder does not generate an error, but instead corrupts the data:

print ( cafe_string.encode("utf-8").decode("latin-1")) 

CafÃ©

This is not what we want. The symbols Ã© is a technical detail of how the unicode symbol U+003E9 is stored; it is not intended to be displayed without being decoded.

If you have ever seen content where accented Latin characters or punctuation marks are replaced with multiple, nonsensical accented characters, the cause is usually failing to decode utf-8-encoded text as utf-8.

Another place where you are likely to encounter unicode decoding is ingesting data from the web. The results from HTTP queries will often require explicit .decode("utf-8").

The Python documentation for when encoding / decoding is required, should you need it, is here: Python Unicode Howto

# Chinese characters usually require three bytes each in utf-8:
content = "十六进制"
content.encode("utf-8")

b'\xe5\x8d\x81\xe5\x85\xad\xe8\xbf\x9b\xe5\x88\xb6'

# But there is no reason to worry too much.
print (content, "让我头疼")

十六进制 让我头疼

Introduction to Data Science I & II

Encodings and Unicode

Contents

Encodings and Unicode#

ASCII - the first 128 code points#

Latin-1 encoding - the rest of the characters?#

Unicode!#

Encoding/decoding#