A few basic questions about encoding, unicode and stdout

Question

Say I do this:

>>> 'é'        #1
'\xc3\xa9'
>>> u'é'       #2
u'\xe9'
>>> print u'é' #3
é

This is my understanding:

When I pasted 'é' into my Python session, a bytearray containing 2 bytes somehow landed in to stdin , which Python read from. The same bytes are sent to stdout and displayed in hexadecimal form.
This time Python has to decode the bytes: it reads sys.stdin.encoding , finds utf-8 , and decodes the 2 bytes into unicode. Then I am not sure what happens. Can we send a unicode string to stdout ? Or maybe Python takes the hexadeximal representation of the unicode code point, encodes it in utf-8 and sends to stdout ?
Python decodes the 2 bytes into unicode. Then print encodes it again in utf-8 and sends the result to stdout .

Is my understanding correct?

Answer 1

The Python interactive interpreter echos the result of any expression except if that result is None . Echoing always uses the repr() function to create a useable representation. Under the hood, objects have a __repr__ special method that does all the hard work here.

For strings, a value is printed that can be used directly in Python again to recreate the string, and any non-printable, non-ASCII bytes are represented with an escape sequence. Newlines become \\n , for example, and the two UTF-8 bytes for é are represented with the \\xhh hex escape.

Thus, for point 1, Python indeed received two bytes from the terminal, stored those in a string, and the representation of the string consists of the characters ' , \\ , x , c , 3 , etc. If you pasted that back into Python, you'd get the same string value again.

For 2., you created a Unicode string object. The terminal sent two UTF-8 bytes, but you now told Python to parse a u'..' string literal, which is indeed decoded by using sys.stdin.encoding .

The representation for a Unicode string object is another string literal, prefixed with u to show it is a Unicode string, not a regular string. Unicode codepoints in the range U+0080 through to U+00FF (the Latin 1 range) are represented by the \\xhh escape code. é is Unicode codepoint U+00E9 , so is represented by \\xe9 . Codepoints from U+0100 up to U+FFFF use the \\uhhhh representation, for higher codepoints \\Uhhhhhhhh is used.

Again, you can copy this representation, paste it back into Python and get the exact same value again.

print writes directly to sys.stdout , and if you give print a Unicode string object, will use sys.stdout.encoding to first encode the Unicode string value to a bytestring before writing it to sys.stdout .

A few basic questions about encoding, unicode and stdout

Question

1 answers

solution1
4 ACCPTED 2014-03-19 14:27:47

A few basic questions about encoding, unicode and stdout

Question

1 answers

solution1 4 ACCPTED 2014-03-19 14:27:47

solution1
4 ACCPTED 2014-03-19 14:27:47