Python 未能将错误的 unicode 编码为 ascii

Question

I have some Python code that's receiving a string with bad unicode in it.我有一些 Python 代码正在接收包含错误 unicode 的字符串。 When I try to ignore the bad characters, Python still chokes (version 2.6.1).当我试图忽略坏字符时，Python 仍然窒息（版本 2.6.1）。 Here's how to reproduce it:以下是如何重现它：

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws它抛出

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?我究竟做错了什么？

Answer 1

Converting a string to a unicode instance is str.decode() in Python 2.x:将字符串转换为 unicode 实例是 Python 2.x 中的str.decode() ：

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'

Answer 2

You are confusing "unicode" and "utf-8".您混淆了“unicode”和“utf-8”。 Your string s is not unicode;您的字符串s不是 unicode； it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding .它是特定编码的字节串（但不是 UTF-8，更可能是 iso-8859-1 等。）从字节串到unicode是通过解码数据而不是编码来完成的。 Going from unicode to bytestring is encoding.从 unicode 到 bytestring 是编码。 Perhaps you meant to make s a unicode string:也许您打算制作s一个 unicode 字符串：

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:或者您可能希望将字节串视为 UTF-8 但忽略无效序列，在这种情况下，您将使用“忽略”作为错误处理程序来解码字节串：

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'

Python 未能将错误的 unicode 编码为 ascii

问题描述

2 个解决方案

解决方案1
10 已采纳 2011-05-25 13:09:40

解决方案2
8 2011-05-25 13:09:54

Python 未能将错误的 unicode 编码为 ascii

问题描述

2 个解决方案

解决方案1 10 已采纳 2011-05-25 13:09:40

解决方案2 8 2011-05-25 13:09:54

解决方案1
10 已采纳 2011-05-25 13:09:40

解决方案2
8 2011-05-25 13:09:54