简体   繁体   中英

encoding unicode using UTF-8

In Python, if I type

euro = u'\u20AC'
euroUTF8 = euro.encode('utf-8')
print(euroUTF8, type(euroUTF8), len(euroUTF8))

the output is

('\xe2\x82\xac', <type 'str'>, 3)

I have two questions: 1. it looks like euroUTF8 is encoded over 3 bytes, but how do I get its binary representation to see how many bits it contain? 2. what does 'x' in '\\xe2\\x82\\xac' mean? I don't think 'x' is a hex number. And why there are three '\\' ?

In Python 2, print is a statement, not a function. You are printing a tuple here. Print the individual elements by removing the (..) :

>>> euro = u'\u20AC'
>>> euroUTF8 = euro.encode('utf-8')
>>> print euroUTF8, type(euroUTF8), len(euroUTF8)
€ <type 'str'> 3

Now you get the 3 individual objects written as strings to stdout; my terminal just happens to be configured to interpret anything written to it as UTF-8, so the bytes correctly result in the Euro symbol being displayed.

The \\x<hh> sequences are Python string literal escape sequences (see the reference documentation ); they are the default output for the repr() applied to a string with non-ASCII, non-printable bytes in them. You'll see the same thing when echoing the value in an interactive interpreter:

>>> euroUTF8
'\xe2\x82\xac'
>>> euroUTF8[0]
'\xe2'
>>> euroUTF8[1]
'\x82'
>>> euroUTF8[2]
'\xac'

They provide you with ASCII-safe debugging output. The contents of all Python standard library containers use this format; including lists, tuples and dictionaries.

If you want to format to see the bits that make up these values, convert each byte to an integer by using the ord() function, then format the integer as binary:

>>> ' '.join([format(ord(b), '08b') for b in euroUTF8])
'11100010 10000010 10101100'
  1. Each letter in each encoding are represented using different number of bits. UTF-8 is a 8 bit encoding, so there is no need to get a binary representation to know each bit count of each character. (If you still want to present bits, refer to Martijn's answer.)

  2. \\x means that the following value is a byte. So x is not something like a hex number that you should convert or read. It identifies the following value, which is you are interested in. \\ 's are used to escape that x 's because they are not a part of the value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM