Say I do this:
>>> 'é' #1
'\xc3\xa9'
>>> u'é' #2
u'\xe9'
>>> print u'é' #3
é
This is my understanding:
'é'
into my Python session, a bytearray containing 2 bytes somehow landed in to stdin
, which Python read from. The same bytes are sent to stdout
and displayed in hexadecimal form. sys.stdin.encoding
, finds utf-8
, and decodes the 2 bytes into unicode. Then I am not sure what happens. Can we send a unicode string to stdout
? Or maybe Python takes the hexadeximal representation of the unicode code point, encodes it in utf-8
and sends to stdout
? print
encodes it again in utf-8
and sends the result to stdout
. Is my understanding correct?
The Python interactive interpreter echos the result of any expression except if that result is None
. Echoing always uses the repr()
function to create a useable representation. Under the hood, objects have a __repr__
special method that does all the hard work here.
For strings, a value is printed that can be used directly in Python again to recreate the string, and any non-printable, non-ASCII bytes are represented with an escape sequence. Newlines become \\n
, for example, and the two UTF-8 bytes for é
are represented with the \\xhh
hex escape.
Thus, for point 1, Python indeed received two bytes from the terminal, stored those in a string, and the representation of the string consists of the characters '
, \\
, x
, c
, 3
, etc. If you pasted that back into Python, you'd get the same string value again.
For 2., you created a Unicode string object. The terminal sent two UTF-8 bytes, but you now told Python to parse a u'..'
string literal, which is indeed decoded by using sys.stdin.encoding
.
The representation for a Unicode string object is another string literal, prefixed with u
to show it is a Unicode string, not a regular string. Unicode codepoints in the range U+0080 through to U+00FF (the Latin 1 range) are represented by the \\xhh
escape code. é
is Unicode codepoint U+00E9 , so is represented by \\xe9
. Codepoints from U+0100 up to U+FFFF use the \\uhhhh
representation, for higher codepoints \\Uhhhhhhhh
is used.
Again, you can copy this representation, paste it back into Python and get the exact same value again.
print
writes directly to sys.stdout
, and if you give print
a Unicode string object, will use sys.stdout.encoding
to first encode the Unicode string value to a bytestring before writing it to sys.stdout
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.