[英]Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?
I have some data in a database which was inputted by a user as "BTS⚾️>BTS🎤", ie "BTS" + the baseball emoji + ">BTS" + the microphone emoji. 我在数据库中有一些数据,由用户输入为“BTS⚾️>BTS🎤”,即“BTS”+棒球表情符号+“> BTS”+麦克风表情符号。 When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly.
当我从数据库中读取它,解码它,并在Python 2中打印它时,它会正确显示表情符号。 But when I try to decode the same bytes in Python 3, it fails with a
UnicodeDecodeError
. 但是当我尝试在Python 3中解码相同的字节时,它会因
UnicodeDecodeError
而失败。
The bytes in Python 2: Python 2中的字节:
>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'
Decoding these as UTF-8 outputs this unicode string: 将这些解码为UTF-8输出此unicode字符串:
>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'
Printing that unicode string on my Mac displays the baseball and microphone emojis: 在我的Mac上打印unicode字符串会显示棒球和麦克风表情符号:
>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS🎤
However in Python 3, decoding the same bytes as UTF-8 gives me an error: 但是在Python 3中,解码与UTF-8相同的字节会给我一个错误:
>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte
In particular, it seems to take issue with the last 6 bytes (the microphone emoji): 特别是,它似乎与最后6个字节(麦克风表情符号)有关:
>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character: 此外,其他工具,如此在线十六进制到Unicode转换器,告诉我这些字节不是有效的Unicode字符:
https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4 https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4
Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not? 为什么Python 2和编码用户输入的任何程序都认为这些字节是麦克风表情符号,但Python 3和其他工具却没有?
It looks like there are a couple web pages that will help answer your question: 看起来有几个网页可以帮助回答您的问题:
If I decode the bytes you got from Python 2 using Python 3's "surrogatepass" error handler, that is: 如果我使用Python 3的“surrogatepass”错误处理程序解码你从Python 2获得的字节,那就是:
b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
errors = 'surrogatepass')
then I get the string 'BTS⚾️>BTS\?\?'
, where '\?\?'
is a surrogate pair that's supposed to stand in for the microphone emogi. 然后我得到字符串
'BTS⚾️>BTS\?\?'
,其中'\?\?'
是代表对,它应该代表麦克风emogi。
You can get back to the microphone in Python 3 by encoding the string with surrogate pairs as UTF-16 with "surrogate pass" and decoding as UTF-16: 您可以通过使用代理对编码带有“代理传递”的UTF-16并解码为UTF-16的字符串来回到Python 3中的麦克风:
>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS🎤
Try to encode again this bytes u'BTS\⚾\️>BTS\\U0001f3a4'
in utf-8 in python 3 尝试在python 3中的utf-8中再次编码这个字节
u'BTS\⚾\️>BTS\\U0001f3a4'
text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')
the result
contains this bytes: result
包含这个字节:
b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'
there are different from this you have in python 2: 你在python 2中有不同之处:
b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xed\\xa0\\xbc\\xed\\xbe\\xa4'
but if you decode again the result
: b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'
in utf-8 in python 3, you will receive the result you want 但如果你再次解码
result
: b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'
在python 3的utf-8中,你会收到你想要的结果
In few words, python2 and python3 works in different ways, so you have to save in database the decoded bytes, that are unique. 简而言之,python2和python3以不同的方式工作,因此您必须在数据库中保存唯一的解码字节。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.