为什么Python 2认为这些字节是麦克风表情符号，但Python 3不是？

Question

I have some data in a database which was inputted by a user as "BTS⚾️>BTS🎤", ie "BTS" + the baseball emoji + ">BTS" + the microphone emoji. 我在数据库中有一些数据，由用户输入为“BTS⚾️>BTS🎤”，即“BTS”+棒球表情符号+“> BTS”+麦克风表情符号。 When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly. 当我从数据库中读取它，解码它，并在Python 2中打印它时，它会正确显示表情符号。 But when I try to decode the same bytes in Python 3, it fails with a UnicodeDecodeError . 但是当我尝试在Python 3中解码相同的字节时，它会因UnicodeDecodeError而失败。

The bytes in Python 2: Python 2中的字节：

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

Decoding these as UTF-8 outputs this unicode string: 将这些解码为UTF-8输出此unicode字符串：

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

Printing that unicode string on my Mac displays the baseball and microphone emojis: 在我的Mac上打印unicode字符串会显示棒球和麦克风表情符号：

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS🎤

However in Python 3, decoding the same bytes as UTF-8 gives me an error: 但是在Python 3中，解码与UTF-8相同的字节会给我一个错误：

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

In particular, it seems to take issue with the last 6 bytes (the microphone emoji): 特别是，它似乎与最后6个字节（麦克风表情符号）有关：

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character: 此外，其他工具，如此在线十六进制到Unicode转换器，告诉我这些字节不是有效的Unicode字符：

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4 https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not? 为什么Python 2和编码用户输入的任何程序都认为这些字节是麦克风表情符号，但Python 3和其他工具却没有？

Answer 1

It looks like there are a couple web pages that will help answer your question: 看起来有几个网页可以帮助回答您的问题：

https://bugs.python.org/issue9133 (Relates to Python 2's overly permissive UTF-8 handling) https://bugs.python.org/issue9133 （与Python 2过度宽松的UTF-8处理有关）
How to work with surrogate pairs in Python? 如何在Python中使用代理对？ (Relates to dealing with that permissiveness) （与处理那种宽容有关）

If I decode the bytes you got from Python 2 using Python 3's "surrogatepass" error handler, that is: 如果我使用Python 3的“surrogatepass”错误处理程序解码你从Python 2获得的字节，那就是：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

then I get the string 'BTS⚾️>BTS\?\?' , where '\?\?' is a surrogate pair that's supposed to stand in for the microphone emogi. 然后我得到字符串'BTS⚾️>BTS\?\?' ，其中'\?\?'是代表对，它应该代表麦克风emogi。

You can get back to the microphone in Python 3 by encoding the string with surrogate pairs as UTF-16 with "surrogate pass" and decoding as UTF-16: 您可以通过使用代理对编码带有“代理传递”的UTF-16并解码为UTF-16的字符串来回到Python 3中的麦克风：

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS🎤

Answer 2

Try to encode again this bytes u'BTS\⚾\️>BTS\\U0001f3a4' in utf-8 in python 3 尝试在python 3中的utf-8中再次编码这个字节u'BTS\⚾\️>BTS\\U0001f3a4'

text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')

the result contains this bytes: result包含这个字节：

b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'

there are different from this you have in python 2: 你在python 2中有不同之处：

b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xed\\xa0\\xbc\\xed\\xbe\\xa4'

but if you decode again the result : b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4' in utf-8 in python 3, you will receive the result you want 但如果你再次解码result ： b'BTS\\xe2\\x9a\\xbe\\xef\\xb8\\x8f>BTS\\xf0\\x9f\\x8e\\xa4'在python 3的utf-8中，你会收到你想要的结果

In few words, python2 and python3 works in different ways, so you have to save in database the decoded bytes, that are unique. 简而言之，python2和python3以不同的方式工作，因此您必须在数据库中保存唯一的解码字节。

为什么Python 2认为这些字节是麦克风表情符号，但Python 3不是？

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-08-16 18:48:23

解决方案2
1 2019-08-16 17:49:49

为什么Python 2认为这些字节是麦克风表情符号，但Python 3不是？

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-08-16 18:48:23

解决方案2 1 2019-08-16 17:49:49

解决方案1
5 已采纳 2019-08-16 18:48:23

解决方案2
1 2019-08-16 17:49:49