简体   繁体   English

Python 未能将错误的 unicode 编码为 ascii

[英]Python failing to encode bad unicode to ascii

I have some Python code that's receiving a string with bad unicode in it.我有一些 Python 代码正在接收包含错误 unicode 的字符串。 When I try to ignore the bad characters, Python still chokes (version 2.6.1).当我试图忽略坏字符时,Python 仍然窒息(版本 2.6.1)。 Here's how to reproduce it:以下是如何重现它:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws它抛出

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?我究竟做错了什么?

Converting a string to a unicode instance is str.decode() in Python 2.x:将字符串转换为 unicode 实例是 Python 2.x 中的str.decode()

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'

You are confusing "unicode" and "utf-8".您混淆了“unicode”和“utf-8”。 Your string s is not unicode;您的字符串s不是 unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding .它是特定编码的字节串(但不是 UTF-8,更可能是 iso-8859-1 等。)从字节串到unicode是通过解码数据而不是编码来完成的。 Going from unicode to bytestring is encoding.从 unicode 到 bytestring 是编码。 Perhaps you meant to make s a unicode string:也许您打算制作s一个 unicode 字符串:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:或者您可能希望将字节串视为 UTF-8 但忽略无效序列,在这种情况下,您将使用“忽略”作为错误处理程序来解码字节串:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM