简体   繁体   English

虽然我正在做str.decode(),但Python会抛出UnicodeEncodeError。 为什么?

[英]Python throws UnicodeEncodeError although I am doing str.decode(). Why?

Consider this function: 考虑这个功能:

def escape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        try:
            c = c.decode('ascii')
        except UnicodeDecodeError:
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

It should escape all non ascii characters by the corresponding htmlentitydefs. 它应该通过相应的htmlentitydefs转义所有非ascii字符。 Unfortunately python throws 不幸的是python抛出

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

when the variable text contains the string whose repr() is u'Tam\\xe1s Horv\\xe1th' . 当变量text包含其repr()u'Tam\\xe1s Horv\\xe1th'的字符串时。

But, I don't use str.encode() . 但是,我不使用str.encode() I only use str.decode() . 我只使用str.decode() Do I miss something? 我错过了什么吗?

It's a misleading error-report which comes from the way python handles the de/encoding process. 这是一个误导性的错误报告,它来自python处理de /编码过程的方式。 You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! 你试图第二次解码一个已经解码过的字符串,这会混淆Python函数,它会让你反过来混淆你的报复! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. ;-)编码/解码过程据我所知,由编解码器模块进行。 And somewhere there lies the origin for this misleading Exception messages. 在某处,存在这种误导性的异常消息的起源。

You may check for yourself: either 您可以自己检查:或者

u'\x80'.encode('ascii')

or 要么

u'\x80'.decode('ascii')

will throw a Unicode Encode Error, where a 将抛出Unicode 编码错误,其中a

u'\x80'.encode('utf8')

will not, but 不会,但是

u'\x80'.decode('utf8')

again will! 再一次!

I guess you are confused by the meaning of encoding and decoding. 我猜你对编码和解码的含义感到困惑。 To put it simple: 简单来说:

                     decode             encode    
ByteString (ascii)  --------> UNICODE  --------->  ByteString (utf8)
            codec                                              codec

But why is there a codec -argument for the decode method? 但为什么decode方法会出现codec参数? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec as an argument. 好吧,底层函数无法猜测ByteString编码的编解码器,因此提示它将codec作为参数。 If not provided it assumes you mean the sys.getdefaultencoding() to be implicitly used. 如果没有提供,它假定您的意思是隐式使用sys.getdefaultencoding()

so when you use c.decode('ascii') you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii. 所以当你使用c.decode('ascii')你a)有一个(编码的)ByteString(这就是为什么你使用解码)b)你想得到一个unicode表示对象(这就是你使用解码的对象)和c )编码ByteString的编解码器是ascii。

See also: https://stackoverflow.com/a/370199/1107807 另见: https//stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

You're passing a string that's already unicode. 你传递的字符串已经是unicode了。 So, before Python can call decode on it, it has to actually encode it - and it does so by default using the ASCII encoding. 因此,在Python可以decode进行decode之前,它必须实际对其进行编码 - 并且默认情况下使用ASCII编码进行编码。

Edit to add It depends on what you want to do. 编辑以添加它取决于您想要做什么。 If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace') . 如果您只想将带有非ASCII字符的unicode字符串转换为HTML编码表示,则可以在一次调用中执行: text.encode('ascii', 'xmlcharrefreplace')

This answer always works for me when I have this problem: 当我遇到这个问题时,这个答案总是对我有用:

def byteify(input):
    '''
    Removes unicode encodings from the given input string.
    '''
    if isinstance(input, dict):
        return {byteify(key):byteify(value) for key,value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

from How to get string objects instead of Unicode ones from JSON in Python? 如何在Python中从JSON获取字符串对象而不是Unicode对象?

Python has two types of strings: character-strings (the unicode type) and byte-strings (the str type). Python有两种类型的字符串:字符串( unicode类型)和字节串( str类型)。 The code you have pasted operates on byte-strings. 您粘贴的代码在字节字符串上运行。 You need a similar function to handle character-strings. 您需要一个类似的函数来处理字符串。

Maybe this: 也许这个:

def uescape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        if (ord(c) < 32) or (ord(c) > 126):
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

I do wonder whether either function is truly necessary for you. 我确实想知道这两种功能是否真的对你有用。 If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8') as the final step before delivering it to the client. 如果是我,我会选择UTF-8作为结果文档的字符编码,以字符串形式处理文档(不用担心实体),并执行content.encode('UTF-8')作为在将其交付给客户之前的最后一步。 Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding. 根据所选的Web框架,您甚至可以直接向API提供字符串,并让它弄清楚如何设置编码。

I found solution in this-site 我在这个网站找到了解决方案

reload(sys)
sys.setdefaultencoding("latin-1")

a = u'\xe1'
print str(a) # no exception

decode a str make no sense. decode str没有意义。

I think you can check ord(c)>127 我想你可以检查ord(c)>127

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 相当于str.decode(&#39;string_escape&#39;) - Equivalent for str.decode('string_escape') 用于utf8编码的字节串的unicode()与str.decode()(python 2.x) - unicode() vs. str.decode() for a utf8 encoded byte string (python 2.x) 如何形成 str.decode 行? - How to form str.decode line? str.decode 仅来自 dataframe 的一些行 - str.decode only some rows from the dataframe 为什么在生产环境中而不是开发环境中出现此UnicodeEncodeError? App Engine Python - Why am I getting this UnicodeEncodeError in production but not in development? App Engine Python 在str.decode中使用的Pylint抛出非文本编码 - Pylint throwing non-text encoding used in str.decode 使用带有 errors='replace' 的 str.decode 仍然会出现错误 - Using str.decode with errors='replace' still gives errors 突然str.decode(&#39;unicode_escape&#39;)停止工作了[2.7.3] - All of a sudden str.decode('unicode_escape') stopped working [2.7.3] Lexer在Python中输出“ TypeError:write()参数必须为str,而不是字节”。 我究竟做错了什么? - Lexer Outputs “TypeError: write() argument must be str, not bytes” in Python. What am I doing wrong? Python - 类型错误:+ 不支持的操作数类型:&#39;NoneType&#39; 和 &#39;str&#39;? 我究竟做错了什么? - Python - TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'? What am i doing wrong?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM