简体   繁体   English

Python ASCII和Unicode解码错误

[英]Python ASCII and Unicode decode error

I got this very very frustrating error when inserting a certain string into my database. 将某个字符串插入数据库时​​,我遇到了非常令人沮丧的错误。 It said something like: 它说:

Python cannot decode byte characters, expecting unicode" Python无法解码字节字符,期待unicode“

After a lot of searching, I saw that I could overcome this error by encoding my string into Unicode . 经过大量的搜索,我看到我可以通过将我的字符串编码为Unicode来克服这个错误。 I try to do this by decoding the string first and then encoding it in UTF-8 format. 我尝试首先解码字符串,然后以UTF-8格式对其进行编码 Like: 喜欢:

string = string.encode("utf8")

And I get the following error: 我收到以下错误:

'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)

I have been dying with this error! 我一直在为这个错误而死! How do I fix it? 我如何解决它?

You need to take a disciplined approach. 你需要采取一种纪律严明的方法。 Pragmatic Unicode, or How Do I Stop The Pain? 务实的Unicode,或者如何阻止痛苦? has everything you need. 拥有你需要的一切。

If you get that error on that line of code, then the problem is that string is a byte string, and Python 2 is implicitly trying to decode it to Unicode for you. 如果你在那行代码上得到了那个错误,那么问题是string是一个字节字符串,而Python 2则隐式地试图将它解码为Unicode。 But it isn't pure ascii. 但它不是纯粹的ascii。 You need to know what the encoding is, and decode it properly. 您需要知道编码是什么,并正确解码。

The encode method should be used on unicode objects to convert them to a str object with a given encoding. 应该在unicode对象上使用encode方法将它们转换为具有给定编码的str对象。 The decode method should be used on str objects of a given encoding to convert them unicode objects. decode方法应该用于给定编码的str对象,以转换它们的unicode对象。

I suppose that your database store strings in UTF-8. 我想你的数据库以UTF-8存储字符串。 So when you get strings from the database, convert them to unicode objects by doing str.decode('utf-8') . 因此,当您从数据库中获取字符串时,通过执行str.decode('utf-8')将它们转换为unicode对象。 Then only use unicode objects in your python program (literals are defined with u'unicode string' ). 然后只在你的python程序中使用unicode对象(文字是用u'unicode string'定义u'unicode string' )。 And just before storing them in your database, convert them to str objects with uni.encode('utf-8') . 在将它们存储到数据库之前,使用uni.encode('utf-8')将它们转换为str对象。

EDIT: As you can see from the downvotes, this is NOT THE BEST WAY TO DO IT. 编辑:正如你从downvotes中看到的,这不是最好的方式。 An excellent, and a highly recommended answer is immediately after this, so if you are looking for a good solution, please use that. 一个优秀的,强烈推荐的答案是在此之后,所以如果您正在寻找一个好的解决方案,请使用它。 This is a hackish solution that will not be kind to you at a later point of time. 这是一个黑客的解决方案,在以后的某个时间点对你不友善。

I feel your pain, I've had a lot of problems with the same error. 我感觉到你的痛苦,我在同样的错误中遇到了很多问题。 The simplest way I solved it (and this might not be the best way, and it depends on your application) was to convert things to unicode, and ignore errors. 我解决它的最简单的方法(这可能不是最好的方式,它取决于你的应用程序)是将事物转换为unicode,并忽略错误。 Here's an example from Unicode HOWTO - Python v2.7.3 documentation 这是Unicode HOWTO的一个例子- Python v2.7.3文档

>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
                    ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'

While this might not be the most expedient method, this is a method that has worked for me. 虽然这可能不是最方便的方法,但这是一种对我有用的方法。

EDIT: 编辑:

A couple of people in the comments have mentioned that this is a bad idea, even though the asker accepted the answer. 评论中有几个人提到这是一个坏主意,即使提问者接受了答案。 It is NOT a great idea, it will screw things up if you are dealing with european and accented characters. 这不是一个好主意,如果你正在处理欧洲和重音字符,它会搞砸。 However, this is something you can use if it is NOT production level code, if it is a personal project you are working on, and you need a quick fix to get things rolling. 但是,如果它不是生产级别代码,如果它是您正在处理的个人项目,并且您需要快速修复以使事情滚动,则可以使用此功能。 You will eventually need to fix it with the right methods, which are mentioned in the answers below. 您最终需要使用正确的方法进行修复,这些方法在下面的答案中提到。

The 0xE3 codepoint is an 'a' with a tilde in Unicode. 0xE3代码点是一个带有波形符号的'a'。 Your original string is most likely already in UTF-8, so you can't decode it using the default ASCII character set. 您的原始字符串很可能已经是UTF-8,因此您无法使用默认的ASCII字符集对其进行解码。

string in python 2.7 is an ecoded string (encoded in ASCII mostly) but not a character string or unicode. python 2.7中的字符串是一个ecoded字符串(主要用ASCII编码),但不是字符串或unicode。

So when you do string.encode('some encoding') you are actually encoding an encoded string (using some encoding) 所以,当你执行string.encode('some encoding')时,你实际上是编码一个编码的字符串(使用一些编码)

Python has to first decode that string using default encoding (ASCII in python 2.7) and then it will further encode. Python必须首先使用默认编码(python 2.7中的ASCII)解码该字符串,然后进一步编码。 Your string is not encoded in ASCII but some other encoding (UTF8, LATIN-1..), so when python tries to decode this using ASCII, it throws an error because ASCII codec cannot decode few characters in your given string which are out of ASCII range (0 - 127) 您的字符串不是用ASCII编码而是用其他编码(UTF8,LATIN-1 ..)编码,所以当python尝试使用ASCII解码时,它会抛出一个错误,因为ASCII编解码器无法解码给定字符串中的少数字符,这些字符不在ASCII范围(0 - 127)

#to encode above given string, first decode that using some encoding
decoded_string = string.decode('utf8')
#now encode that decoded string
decoded_string.encode('utf8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM