简体   繁体   English

为什么Python 2认为这些字节是麦克风表情符号,但Python 3不是?

[英]Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

I have some data in a database which was inputted by a user as "BTS⚾️>BTS🎤", ie "BTS" + the baseball emoji + ">BTS" + the microphone emoji. 我在数据库中有一些数据,由用户输入为“BTS⚾️>BTS🎤”,即“BTS”+棒球表情符号+“> BTS”+麦克风表情符号。 When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly. 当我从数据库中读取它,解码它,并在Python 2中打印它时,它会正确显示表情符号。 But when I try to decode the same bytes in Python 3, it fails with a UnicodeDecodeError . 但是当我尝试在Python 3中解码相同的字节时,它会因UnicodeDecodeError而失败。

The bytes in Python 2: Python 2中的字节:

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

Decoding these as UTF-8 outputs this unicode string: 将这些解码为UTF-8输出此unicode字符串:

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

Printing that unicode string on my Mac displays the baseball and microphone emojis: 在我的Mac上打印unicode字符串会显示棒球和麦克风表情符号:

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS🎤

However in Python 3, decoding the same bytes as UTF-8 gives me an error: 但是在Python 3中,解码与UTF-8相同的字节会给我一个错误:

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

In particular, it seems to take issue with the last 6 bytes (the microphone emoji): 特别是,它似乎与最后6个字节(麦克风表情符号)有关:

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character: 此外,其他工具,如此在线十六进制到Unicode转换器,告诉我这些字节不是有效的Unicode字符:

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4 https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not? 为什么Python 2和编码用户输入的任何程序都认为这些字节是麦克风表情符号,但Python 3和其他工具却没有?

It looks like there are a couple web pages that will help answer your question: 看起来有几个网页可以帮助回答您的问题:

If I decode the bytes you got from Python 2 using Python 3's "surrogatepass" error handler, that is: 如果我使用Python 3的“surrogatepass”错误处理程序解码你从Python 2获得的字节,那就是:

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

then I get the string 'BTS⚾️>BTS\?\?' , where '\?\?' is a surrogate pair that's supposed to stand in for the microphone emogi. 然后我得到字符串'BTS⚾️>BTS\?\?' ,其中'\?\?'是代表对,它应该代表麦克风emogi。

You can get back to the microphone in Python 3 by encoding the string with surrogate pairs as UTF-16 with "surrogate pass" and decoding as UTF-16: 您可以通过使用代理对编码带有“代理传递”的UTF-16并解码为UTF-16的字符串来回到Python 3中的麦克风:

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS🎤

Try to encode again this bytes u'BTS\⚾\️>BTS\\U0001f3a4' in utf-8 in python 3 尝试在python 3中的utf-8中再次编码这个字节u'BTS\⚾\️>BTS\\U0001f3a4'

text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')

the result contains this bytes: result包含这个字节:

b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'

there are different from this you have in python 2: 你在python 2中有不同之处:

b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xed\\xa0\\xbc\\xed\\xbe\\xa4'

but if you decode again the result : b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4' in utf-8 in python 3, you will receive the result you want 但如果你再次解码resultb'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'在python 3的utf-8中,你会收到你想要的结果

In few words, python2 and python3 works in different ways, so you have to save in database the decoded bytes, that are unique. 简而言之,python2和python3以不同的方式工作,因此您必须在数据库中保存唯一的解码字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM