Python：ASCII默认编码为UnicodeDecodeError

Question

我正在使用ASCII的默认编码在Python 2.7中进行一些文本处理。 尝试将某些字符串编码为utf-8时出现UnicodeDecodeError 。 具体来说，对于文档中的每个单词，我都会这样做：

word = word.encode('utf-8')

当我的字符全部为ASCII时，这很好用，但是当我的字符不是全部时，我得到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)

我很困惑，因为我认为调用encode会将ASCII转换为utf-8 。 由于utf-8是ASCII的超集，所以我应该不会有任何问题...但是我知道。

另外，我不确定为什么当我希望ASCII不能将我的单词编码为utf-8时为什么会说ASCII无法解码。

任何帮助都是极好的！

Answer 1

您编码为字节字符串，解码为Unicode字符串。 因此，要编码为UTF-8字节字符串，请从Unicode字符串开始。 如果从字节字符串开始，Python 2.7首先使用默认的ASCII编解码器将其隐式解码为Unicode。 如果您的字节字符串包含非ASCII，则将收到UnicodeDecodeError 。

当您从字节字符串开始时，Python 3会删除对Unicode的隐式解码，实际上.encode()在字节字符串上不可用，而.decode在Unicode字符串上不可用。 Python 3还将默认编码更改为UTF-8。

例子：

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode('utf8')  # Started with a byte string
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal not in range(128)
>>> u'café'.encode('utf8')  # Started with Unicode string
'caf\xc3\xa9'

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode()  # Starting with a Unicode string, default UTF-8.
b'caf\xc3\xa9'
>>> 'café'.decode()  # You can only *encode* Unicode strings.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

进一步阅读： https : //www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no -借口/

Python：ASCII默认编码为UnicodeDecodeError

问题描述

1 个解决方案

解决方案1
2 2018-07-03 03:19:07

Python：ASCII默认编码为UnicodeDecodeError

问题描述

1 个解决方案

解决方案1 2 2018-07-03 03:19:07

解决方案1
2 2018-07-03 03:19:07