简体   繁体   English

Python编码错误,而非Unicode字符串

[英]Python Encoding Error, not unicode string

How to get rid of the "u" without having other encoding problems ? 如何摆脱“ u”而没有其他编码问题?

u"Example Characters : \xc3\xa9 \xc3\xa0"

Here what it prints : 这里显示的内容:

Example Characters : é Ã

Instead of : 代替 :

Example Characters : é à

I encounter this problem when using getText() on a BeautifulSoup element. 在BeautifulSoup元素上使用getText()时遇到此问题。 (The webpage is in UTF-8) (该网页使用的是UTF-8)

You have a Mojibake (wrong decoding of the input). 您有一个Mojibake (错误的输入解码)。

You most likely passed a Unicode string to BeautifulSoup() . 您很可能将Unicode字符串传递给BeautifulSoup() Don't do this , leave decoding to BeautifulSoup. 不要这样做 ,将解码留给BeautifulSoup。

For example, if you used requests , use response.content , not response.text to pass the HTML to BeautifulSoup() . 例如,如果您使用requests ,请使用response.content而不是response.text将HTML传递给BeautifulSoup() Otherwise you run the risk of the result being decoded as Latin-1, the default encoding for text responses over HTTP without an explicit character set mentioned in the headers. 否则,您将冒着将结果解码为Latin-1的风险,这是HTTP上文本响应的默认编码,而在标头中没有明确的字符集。 If you used urllib2 , don't decode first. 如果您使用urllib2请不要先解码。

BeatifulSoup detects the encoding and decodes for you; BeatifulSoup会为您检测编码和解码; it'll use HTML <meta> tags if present. 如果存在,它将使用HTML <meta>标签。 UTF-8 should be autodetected correctly. 应该正确自动检测UTF-8。 If you know the encoding up front and BeautifulSoup got it wrong anyway, use from_encoding to specify the correct encoding: 如果您from_encoding知道编码,并且BeautifulSoup还是将其弄错了,请使用from_encoding指定正确的编码:

soup = BeautifulSoup(htmlsource, from_encoding='utf8')

See the Encodings section of the BeautifulSoup documentation. 请参阅BeautifulSoup文档的“ 编码”部分

If after all that you are still getting Mojibake results then the web page itself has produced data with incorrectly encoded values. 如果毕竟您仍能获得Mojibake结果,则该网页本身已生成具有错误编码值的数据。 In that case you can undo the error with: 在这种情况下,您可以使用以下方法来消除错误:

mojibake_string.encode('latin1').decode('utf8')

This re-interprets the characters in the correct encoding: 这将以正确的编码重新解释字符:

>>> u"Example Characters : \xc3\xa9 \xc3\xa0".encode('latin1').decode('utf8')
u'Example Characters : \xe9 \xe0'
>>> print _
Example Characters : é à

There is no need to be concerned about the u prefix; 无需担心u前缀; that is just a type indicator, to show you have a Unicode value. 那只是一个类型指示器,以显示您具有Unicode值。

The string you created unambiguously contains the Unicode characters U+00C3 , U+00A9 , and U+00A0 . 您创建的字符串明确包含Unicode字符U + 00C3U + 00A9U + 00A0 Their printed representation is the string you say you don't want. 它们的打印表示形式是您说不想使用的字符串。

Apparently you are trying to embed a UTF-8 string. 显然,您正在尝试嵌入UTF-8字符串。 That's a byte string ( b'...' in Python 3.x), not a Unicode string ( u'...' ). 这是一个字节字符串(在Python 3.x中为b'...' ),而不是Unicode字符串( u'...' )。 To get the string you actually wanted, try 要获取您实际想要的字符串,请尝试

"Example Characters : \xc3\xa9 \xc3\xa0".decode('utf-8')

which produces a Unicode string containing the actual characters you want. 产生包含所需实际字符的Unicode字符串。

See also http://nedbatchelder.com/text/unipain.html 另请参见http://nedbatchelder.com/text/unipain.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM