简体   繁体   English

Python - “ascii”编解码器无法解码字节

[英]Python - 'ascii' codec can't decode byte

I'm really confused.我真的很困惑。 I tried to encode but the error said can't decode... .我试图编码,但错误说can't decode...

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

I know how to avoid the error with "u" prefix on the string.我知道如何避免字符串上带有“u”前缀的错误。 I'm just wondering why the error is "can't decode" when encode was called.我只是想知道为什么在调用 encode 时错误是“无法解码”。 What is Python doing under the hood? Python 在幕后做了什么?

"你好".encode('utf-8')

encode converts a unicode object to a string object. encode将 unicode 对象转换为string对象。 But here you have invoked it on a string object (because you don't have the u).但是在这里您已经在string对象上调用了它(因为您没有 u)。 So python has to convert the string to a unicode object first.所以python必须首先将string转换为unicode对象。 So it does the equivalent of所以它相当于

"你好".decode().encode('utf-8')

But the decode fails because the string isn't valid ascii.但是解码失败,因为字符串不是有效的 ascii。 That's why you get a complaint about not being able to decode.这就是为什么您会抱怨无法解码的原因。

Always encode from unicode to bytes.始终从 unicode编码为字节。
In this direction, you get to choose the encoding .在这个方向上,您可以选择 encoding

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.另一种方法是从字节解码为 un​​icode。
In this direction, you have to know what the encoding is .在这个方向上,您必须知道编码是什么

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can't be stressed enough.这一点再怎么强调也不为过。 If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level.如果您想避免玩 unicode “whack-a-mole”,那么了解数据级别发生的事情很重要。 Here it is explained another way:这里用另一种方式解释:

  • A unicode object is decoded already, you never want to call decode on it.一个 unicode 对象已经被解码了,你永远不想在它上面调用decode
  • A bytestring object is encoded already, you never want to call encode on it.一个字节串对象已经被编码,你永远不想在它上面调用encode

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object).现在,在字节字符串上看到.encode时,Python 2 首先尝试将其隐式转换为文本(一个unicode对象)。 Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).类似地,在 unicode 字符串上看到.decode时,Python 2 会隐式地尝试将其转换为字节(一个str对象)。

These implicit conversions are why you can get Unicode Decode Error when you've called encode .这些隐式转换是您在调用encode时会得到Unicode Decode Error原因。 It's because encoding usually accepts a parameter of type unicode ;这是因为 encoding 通常接受一个unicode类型的参数; when receiving a str parameter, there's an implicit decoding into an object of type unicode before re-encoding it with another encoding.当接收到str参数时,在用另一种编码重新编码之前,会隐式解码为unicode类型的对象。 This conversion chooses a default 'ascii' decoder , giving you the decoding error inside an encoder.此转换选择默认的 'ascii' 解码器 ,为您提供编码器内的解码错误。

In fact, in Python 3 the methods str.decode and bytes.encode don't even exist.事实上,在 Python 3 中str.decodebytes.encode方法甚至不存在。 Their removal was a [controversial] attempt to avoid this common confusion.他们的移除是为了避免这种常见的混淆[有争议]。

...or whatever coding sys.getdefaultencoding() mentions; ...或任何编码sys.getdefaultencoding()提到的; usually this is 'ascii'通常这是“ascii”

You can try this你可以试试这个

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or要么

You can also try following您也可以尝试以下

Add following line at top of your .py file.在 .py 文件顶部添加以下行。

# -*- coding: utf-8 -*- 

If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a u :如果您使用的是 Python < 3,则需要通过在它前面加上u来告诉解释器您的字符串文字是 Unicode

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading : Unicode HOWTO .进一步阅读Unicode HOWTO

You use u"你好".encode('utf8') to encode an unicode string.您使用u"你好".encode('utf8')对 unicode 字符串进行编码。 But if you want to represent "你好" , you should decode it.但是如果你想代表"你好" ,你应该解码它。 Just like:就像:

"你好".decode("utf8")

You will get what you want.你会得到你想要的。 Maybe you should learn more about encode & decode.也许您应该了解更多有关编码和解码的信息。

In case you're dealing with Unicode, sometimes instead of encode('utf-8') , you can also try to ignore the special characters, eg如果您正在处理 Unicode,有时代替encode('utf-8') ,您也可以尝试忽略特殊字符,例如

"你好".encode('ascii','ignore')

or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here .或如这里所建议的something.decode('unicode_escape').encode('ascii','ignore')

Not particularly useful in this example, but can work better in other scenarios when it's not possible to convert some special characters.在此示例中不是特别有用,但在无法转换某些特殊字符的其他情况下可以更好地工作。

Alternatively you can consider replacing particular character using replace() .或者,您可以考虑使用replace()替换特定字符

If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.如果您从 Linux 或类似系统(BSD,不确定 Mac)上的 shell 启动 python 解释器,您还应该检查 shell 的默认编码。

Call locale charmap from the shell (not the python interpreter) and you should see从 shell(不是 python 解释器)调用locale charmap ,你应该看到

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

If this is not the case, and you see something else, eg如果不是这种情况,并且您会看到其他内容,例如

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $ 

Python will (at least in some cases such as in mine) inherit the shell's encoding and will not be able to print (some? all?) unicode characters. Python 将(至少在某些情况下,例如在我的情况下)继承 shell 的编码,并且将无法打印(某些?全部?)unicode 字符。 Python's own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.在这种情况下,您通过sys.getdefaultencoding()sys.setdefaultencoding()看到和控制的 Python 自己的默认编码将被忽略。

If you find that you have this problem, you can fix that by如果你发现你有这个问题,你可以通过

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this. (或者选择您想要的任何键映射而不是 en_EN。)您还可以编辑/etc/locale.conf (或管理系统中区域设置的任何文件)来更正此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:“ ascii”编解码器无法解码字节 - Python: 'ascii' codec can't decode byte Python - 'ascii'编解码器无法解码字节 - Python - 'ascii' codec can't decode byte Python 2.7 UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码字节 - Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte Python(nltk)-UnicodeDecodeError:“ ascii”编解码器无法解码字节 - Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte Python - 'ascii'编解码器无法解码byte \ xbd的位置 - Python - 'ascii' codec can't decode byte \xbd in position UnicodeDecodeError:“ ascii”编解码器无法在Python中解码字节 - UnicodeDecodeError: 'ascii' codec can't decode byte in Python UnicodeDecodeError:“ ascii”编解码器无法解码字节-Python - UnicodeDecodeError: 'ascii' codec can't decode byte - Python Python和Pandas:UnicodeDecodeError:“ ascii”编解码器无法解码字节 - Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码字节... Python 2.7和 - UnicodeDecodeError: 'ascii' codec can't decode byte … Python 2.7 and Python连接字符串-UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码字节 - Python concatenating strings - UnicodeDecodeError: 'ascii' codec can't decode byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM