简体   繁体   English

处理 Python unicode 字符串中错误编码的字符

[英]Handle wrongly encoded character in Python unicode string

I am dealing with unicode strings returned by the python-lastfm library.我正在处理由 python-lastfm 库返回的 unicode 字符串。

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.我假设在途中的某个地方,库的编码错误并返回一个可能包含无效字符的 unicode 字符串。

For example, the original string i am expecting in the variable a is "Glück"例如,我在变量 a 中期望的原始字符串是“Glück”

>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

\\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". \\xfc 是转义值 252,对应于“ü”的 latin1 编码。 Somehow this gets embedded in the unicode string in a way python can't handle on its own.不知何故,这以 Python 无法自行处理的方式嵌入到 unicode 字符串中。

How do i convert this back a normal or unicode string that contains the original "Glück"?我如何将其转换回包含原始“Glück”的普通或 unicode 字符串? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \\xfc.我尝试使用解码/编码方法,但要么得到了 UnicodeEncodeError,要么得到了包含序列 \\xfc 的字符串。

You have to convert your unicode string into a standard string using some encoding eg utf-8:您必须使用某种编码(例如 utf-8)将您的 unicode 字符串转换为标准字符串:

some_unicode_string.encode('utf-8')

Apart from that: this is a dupe of除此之外:这是一个骗局

BeautifulSoup findall with class attribute- unicode encode error 具有类属性的 BeautifulSoup findall-unicode 编码错误

and at least ten other related questions on SO.以及至少十个关于 SO 的其他相关问题。 Research first.先研究一下。

Your unicode string is fine:你的 unicode 字符串很好:

>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters.您在交互式提示中看到的问题是解释器不知道使用什么编码将字符串输出到您的终端,因此它退回到“ascii”编解码器——但该编解码器只知道如何处理 ASCII人物。 It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)它在我的机器上运行良好(因为 sys.stdout.encoding 对我来说是“UTF-8”——可能是因为我的环境变量设置与你的不同)

>>> print u'Gl\xfcck'
Glück

At the beginning of your code, just after imports, add these 3 lines.在代码的开头,就在导入之后,添加这 3 行。

import sys  # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')

It will override system default encoding (ascii) for the course of your program.它将在您的程序过程中覆盖系统默认编码 (ascii)。

Edit: You shouldn't do this unless you are sure of the consequences, see comment below.编辑:除非您确定后果,否则您不应该这样做,请参阅下面的评论。 This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')这篇文章也很有帮助: sys.setdefaultencoding('utf-8') 的危险

Do not str() cast to string what you've got from model fields, as long as it is an unicode string already.不要将str() 转换为您从模型字段中获得的字符串,只要它已经是一个 unicode 字符串。 (oops I have totally missed that it is not django-related) (哎呀,我完全错过了它与 django 无关)

I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8.我自己在处理一个包含德语单词的文件时偶然发现了这个错误,我不知道它是用 UTF-8 编码的。 The problem manifest itself when I start processing words and some of them would't show the decoding error.当我开始处理单词时,问题就出现了,其中一些单词不会显示解码错误。

# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40) 
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

I solve the error calling the encode method on the string:我解决了在字符串上调用 encode 方法的错误:

>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM