简体   繁体   中英

UTF8 Encoding and decoding in python

I have a UTF8 String piped from Java to python.

The end result is

'\xe0\xb8\x9a\xe0\xb8\x99'

Hence for example

a = '\xe0\xb8\x9a\xe0\xb8\x99'

a.decode('utf-8') 

gives me the result

u'\u0e1a\u0e19'

however, what i am curious is since the bytes is piped in as UTF-8, why would be

'\xe0\xb8\x9a\xe0\xb8\x99'

instead of u'\บ\น' .

If i were to encode (u'\บ\น') i would get back '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99'.

So what is the inherent difference between these two and how i do actually understand when to use decode and encode.

UTF8 String is insufficient to describe the statement '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' is; it really should be called UTF8 encoding of a unicode string.

Python 2's unicode type and Python 3's str type represents a string of unicode code points, so the statement u'\บ\น' is the python representation of the two code points U+0E1A U+0E19 and in human terms it will be rendered as บน .

As for explaining the whole encode and decode calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' as a utf-8 encoded input in order to get that back into what unicode code points they represent (which is u'\บ\น' ). Calling encode on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str type and Python 3 it will be actually be the bytes type) will get back to the series of bytes that is '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' .

Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\\xff\\xfe\\x1a\\x0e\\x19\\x0e' , or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11 encoding for this, which will be encoded into the bytes '\\xba\\xb9' - but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\\xba\\xb9' could be decoded using the iso8859-1 encoding which would be rendered as º¹ or iso8859-11 as บน .

In short, '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' is the UTF8 encoding of the unicode code points for u'\บ\น' in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.

Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

'\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' is simply a series of bytes. You have chosen to interpret that as UTF-8, and when you do, you can decode it into a series of unicode characters, U+e1a and U+e19.

The sequence U+e1a, U+e19 can be represented as u'\บ\น', but in some sense that representation is as arbitrary as '\\xe0\\xb8\\x9a\\xe0\\xb8\\x99'. It is "natural", that's why Python prints them that way, but it's inefficent, which is why there are various other encoding schemes, including UTF-8

In fact, it's slightly misleading for me to say "'\\xe0\\xb8\\x9a\\xe0\\xb8\\x99' is a series of bytes." It is the default representation of a series of bytes, two hundred twenty-four, followed by one hundred eighty-four, and so on.

Python has a notion of a series of bytes, and it has a separate notion of series of unicode characters. encode and decode represent one way of mapping between those two notions.

Does that help?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM