[英]Python throws UnicodeEncodeError although I am doing str.decode(). Why?
Consider this function: 考虑这个功能:
def escape(text):
print repr(text)
escaped_chars = []
for c in text:
try:
c = c.decode('ascii')
except UnicodeDecodeError:
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
It should escape all non ascii characters by the corresponding htmlentitydefs. 它应该通过相应的htmlentitydefs转义所有非ascii字符。 Unfortunately python throws
不幸的是python抛出
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
when the variable text
contains the string whose repr()
is u'Tam\\xe1s Horv\\xe1th'
. 当变量
text
包含其repr()
为u'Tam\\xe1s Horv\\xe1th'
的字符串时。
But, I don't use str.encode()
. 但是,我不使用
str.encode()
。 I only use str.decode()
. 我只使用
str.decode()
。 Do I miss something? 我错过了什么吗?
It's a misleading error-report which comes from the way python handles the de/encoding process. 这是一个误导性的错误报告,它来自python处理de /编码过程的方式。 You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn!
你试图第二次解码一个已经解码过的字符串,这会混淆Python函数,它会让你反过来混淆你的报复! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module.
;-)编码/解码过程据我所知,由编解码器模块进行。 And somewhere there lies the origin for this misleading Exception messages.
在某处,存在这种误导性的异常消息的起源。
You may check for yourself: either 您可以自己检查:或者
u'\x80'.encode('ascii')
or 要么
u'\x80'.decode('ascii')
will throw a Unicode Encode Error, where a 将抛出Unicode 编码错误,其中a
u'\x80'.encode('utf8')
will not, but 不会,但是
u'\x80'.decode('utf8')
again will! 再一次!
I guess you are confused by the meaning of encoding and decoding. 我猜你对编码和解码的含义感到困惑。 To put it simple:
简单来说:
decode encode
ByteString (ascii) --------> UNICODE ---------> ByteString (utf8)
codec codec
But why is there a codec
-argument for the decode
method? 但为什么
decode
方法会出现codec
参数? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec
as an argument. 好吧,底层函数无法猜测ByteString编码的编解码器,因此提示它将
codec
作为参数。 If not provided it assumes you mean the sys.getdefaultencoding()
to be implicitly used. 如果没有提供,它假定您的意思是隐式使用
sys.getdefaultencoding()
。
so when you use c.decode('ascii')
you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii. 所以当你使用
c.decode('ascii')
你a)有一个(编码的)ByteString(这就是为什么你使用解码)b)你想得到一个unicode表示对象(这就是你使用解码的对象)和c )编码ByteString的编解码器是ascii。
See also: https://stackoverflow.com/a/370199/1107807 另见: https : //stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
You're passing a string that's already unicode. 你传递的字符串已经是unicode了。 So, before Python can call
decode
on it, it has to actually encode it - and it does so by default using the ASCII encoding. 因此,在Python可以
decode
进行decode
之前,它必须实际对其进行编码 - 并且默认情况下使用ASCII编码进行编码。
Edit to add It depends on what you want to do. 编辑以添加它取决于您想要做什么。 If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call:
text.encode('ascii', 'xmlcharrefreplace')
. 如果您只想将带有非ASCII字符的unicode字符串转换为HTML编码表示,则可以在一次调用中执行:
text.encode('ascii', 'xmlcharrefreplace')
。
This answer always works for me when I have this problem: 当我遇到这个问题时,这个答案总是对我有用:
def byteify(input):
'''
Removes unicode encodings from the given input string.
'''
if isinstance(input, dict):
return {byteify(key):byteify(value) for key,value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
from How to get string objects instead of Unicode ones from JSON in Python? 如何在Python中从JSON获取字符串对象而不是Unicode对象?
Python has two types of strings: character-strings (the unicode
type) and byte-strings (the str
type). Python有两种类型的字符串:字符串(
unicode
类型)和字节串( str
类型)。 The code you have pasted operates on byte-strings. 您粘贴的代码在字节字符串上运行。 You need a similar function to handle character-strings.
您需要一个类似的函数来处理字符串。
Maybe this: 也许这个:
def uescape(text):
print repr(text)
escaped_chars = []
for c in text:
if (ord(c) < 32) or (ord(c) > 126):
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
I do wonder whether either function is truly necessary for you. 我确实想知道这两种功能是否真的对你有用。 If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a
content.encode('UTF-8')
as the final step before delivering it to the client. 如果是我,我会选择UTF-8作为结果文档的字符编码,以字符串形式处理文档(不用担心实体),并执行
content.encode('UTF-8')
作为在将其交付给客户之前的最后一步。 Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding. 根据所选的Web框架,您甚至可以直接向API提供字符串,并让它弄清楚如何设置编码。
decode
a str
make no sense. decode
str
没有意义。
I think you can check ord(c)>127
我想你可以检查
ord(c)>127
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.