将utf-8字符串作为内容转换为str的unicode

Question

I'm using pyquery to parse a page: 我正在使用pyquery来解析页面：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content: 但我得到的content是一个带有utf-8编码内容的unicode字符串：

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to str without lost the content? 怎么能把它转换成str而不丢失内容？

to make it clear: 说清楚：

I want conent == '\\xe5\\xb1\\x82\\xe5\\x8f\\xa0\\xe6\\xa0\\xb7\\xe5\\xbc\\x8f\\xe8\\xa1\\xa8' 我想要conent == '\\xe5\\xb1\\x82\\xe5\\x8f\\xa0\\xe6\\xa0\\xb7\\xe5\\xbc\\x8f\\xe8\\xa1\\xa8'

not conent == u'\\xe5\\xb1\\x82\\xe5\\x8f\\xa0\\xe6\\xa0\\xb7\\xe5\\xbc\\x8f\\xe8\\xa1\\xa8' not conent == u'\\xe5\\xb1\\x82\\xe5\\x8f\\xa0\\xe6\\xa0\\xb7\\xe5\\xbc\\x8f\\xe8\\xa1\\xa8'

Answer 1

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes': 如果你有一个UTF-8字节的unicode值，编码为Latin-1以保留'bytes'：

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; 因为Unicode码点U + 0000到U + 00FF都使用latin-1编码一对一映射; this encoding thus interprets your data as literal bytes. 因此，此编码将您的数据解释为文字字节。

For your example this gives me: 对于你的例子，这给了我：

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests , uses the .text attribute of the response. PyQuery使用requests或urllib来检索HTML，在requests的情况下，使用响应的.text属性。 This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). 这仅基于Content-Type标头中的编码集自动解码响应数据，或者如果该信息不可用，则使用latin-1 （对于文本响应，但HTML是文本响应）。 You can override this by passing in an encoding argument: 您可以通过传入encoding参数来覆盖它：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all. 此时你根本不需要重新编码。

将utf-8字符串作为内容转换为str的unicode

问题描述

1 个解决方案

解决方案1
26 已采纳 2013-01-26 18:18:30

将utf-8字符串作为内容转换为str的unicode

问题描述

1 个解决方案

解决方案1 26 已采纳 2013-01-26 18:18:30

解决方案1
26 已采纳 2013-01-26 18:18:30