简体   繁体   中英

decode-encode UTF-8 doesn't lead to the original unicode

When I am trying to separate two Unicode characters by decoding and encoding them again I do not get the same Unicode in return but I get a different one.

Attached are the responses when I try to do so.

>>> s ='\xf0\x9f\x93\xb1\xf0\x9f\x9a\xac'
>>> u = s.decode("utf-8")
>>> u
u'\U0001f4f1\U0001f6ac'
>>> u[0].encode("utf-8")
'\xed\xa0\xbd'
>>> u[1].encode("utf-8")
'\xed\xb3\xb1'
>>> u[0]
u'\ud83d'
>>> u[1]
u'\udcf1'

Your version of python is using UCS-2 (16 bits per character) but these particular unicode characters require 32 bits, so element of u represents "half" of a character. u.encode('utf-8') works properly because it understanding the encoding.

Your utf-8 string encodes these two characters:

U+1F4F1 MOBILE PHONE character (📱)

U+1F6AC SMOKING SYMBOL character (🚬)

(via this decoder: http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM