从Python中的HTML元素解析unicode字符

Question

I have the following code to parse HTML sites. 我有以下代码来解析HTML网站。 This returns an HTML element object. 这将返回一个HTML元素对象。 I would like to run this code on several machines so it's important for me to implement proxy when someone tries to run it from behind a proxy. 我想在多台计算机上运行此代码，因此当有人尝试从代理后面运行代理时，实现代理对我来说很重要。

from lxml.html import parse

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    site = parse(conn).getroot()
else:
    site = parse(url).getroot()
return site

After it returns the HTML element I get data from the object by using Xpath expressions like this: 在返回HTML元素之后，我使用如下Xpath表达式从对象获取数据：

element = site.xpath(expression)

The problem is that the result contains non-unicode data, which contains escape characters. 问题在于结果包含非Unicode数据，其中包含转义字符。 Eg: 例如：

\\xe1ci\\xf3s kombi \\ xe1ci \\ xf3s kombi

I tried this implementation as well but this one gives me an error: 我也尝试了这种实现，但是这给了我一个错误：

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    rawdata = conn.read()
    parser = etree.HTMLParser(encoding="utf-8")
    site = etree.HTML(rawdata, parser=parser)
else:
    site = parse(url).getroot()
return site

And the error message is: 错误消息是：

'utf8' decode can't decode byte 0xf3 in position 4: invalid continuation byte 'utf8'解码无法解码位置4的字节0xf3：无效的连续字节

The site is using iso-8859-2 charset. 该站点正在使用iso-8859-2字符集。

Is there a way to convert non-unicode characters to unicode with one of the parsing methods listed above? 有没有一种方法可以使用上面列出的一种解析方法将非unicode字符转换为unicode？ Or maybe I'm getting something wrong and I'm getting the data in a correct format but I have problem with the representation. 或者，也许我出了点问题，并以正确的格式获取了数据，但是表示存在问题。

Should I use lxml.fromstring instead and use the encoding parameter? 我应该改用lxml.fromstring并使用encoding参数吗？

Thanks, g0m3z 谢谢，g0m3z

Solution: 解：

Actually there was no problem with my code but the representation of data. 实际上，我的代码没有问题，但数据表示没有问题。 The first code implementation works fine. 第一个代码实现工作正常。

I load the result to a dictionary and when I print the whole dictionary at one shot it shows unicode characters incorrectly. 我将结果加载到字典中，并且一次打印整个字典时，它会错误地显示unicode字符。 However, if I print only one item of the result dictionary based on a key it represents the unicode characters correctly. 但是，如果我仅根据一个键打印结果字典中的一项，则它可以正确表示Unicode字符。 So it works! 这样就行了！ Interesting. 有趣。 Thanks for everyone on this thread for valuable comments! 感谢在此主题上的每个人的宝贵意见！

Answer 1

You should read the actual character encoding from the HTTP-headers (or HTML-meta-tags) and not guess it. 您应该从HTTP标头（或HTML-meta-tags）中读取实际的字符编码，而不要猜测。 This way you can avoid decoding errors. 这样可以避免解码错误。

Answer 2

You can try to use a library for de parsing of the request. 您可以尝试使用库来解析请求。 I recomend you BeautifulSoup. 我向您推荐BeautifulSoup。 This will handle with all the problems with the encode and is very easy to use it. 这将解决编码的所有问题，并且非常易于使用。

从Python中的HTML元素解析unicode字符

问题描述

2 个解决方案

解决方案1
0 2013-11-12 13:56:50

解决方案2
0 2013-11-12 14:07:26

从Python中的HTML元素解析unicode字符

问题描述

2 个解决方案

解决方案1 0 2013-11-12 13:56:50

解决方案2 0 2013-11-12 14:07:26

解决方案1
0 2013-11-12 13:56:50

解决方案2
0 2013-11-12 14:07:26