简体   繁体   English

从Python中的HTML元素解析unicode字符

[英]Parse unicode characters from HTML element in Python

I have the following code to parse HTML sites. 我有以下代码来解析HTML网站。 This returns an HTML element object. 这将返回一个HTML元素对象。 I would like to run this code on several machines so it's important for me to implement proxy when someone tries to run it from behind a proxy. 我想在多台计算机上运行此代码,因此当有人尝试从代理后面运行代理时,实现代理对我来说很重要。

from lxml.html import parse

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    site = parse(conn).getroot()
else:
    site = parse(url).getroot()
return site

After it returns the HTML element I get data from the object by using Xpath expressions like this: 在返回HTML元素之后,我使用如下Xpath表达式从对象获取数据:

element = site.xpath(expression)

The problem is that the result contains non-unicode data, which contains escape characters. 问题在于结果包含非Unicode数据,其中包含转义字符。 Eg: 例如:

\\xe1ci\\xf3s kombi \\ xe1ci \\ xf3s kombi

I tried this implementation as well but this one gives me an error: 我也尝试了这种实现,但是这给了我一个错误:

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    rawdata = conn.read()
    parser = etree.HTMLParser(encoding="utf-8")
    site = etree.HTML(rawdata, parser=parser)
else:
    site = parse(url).getroot()
return site

And the error message is: 错误消息是:

'utf8' decode can't decode byte 0xf3 in position 4: invalid continuation byte 'utf8'解码无法解码位置4的字节0xf3:无效的连续字节

The site is using iso-8859-2 charset. 该站点正在使用iso-8859-2字符集。

Is there a way to convert non-unicode characters to unicode with one of the parsing methods listed above? 有没有一种方法可以使用上面列出的一种解析方法将非unicode字符转换为unicode? Or maybe I'm getting something wrong and I'm getting the data in a correct format but I have problem with the representation. 或者,也许我出了点问题,并以正确的格式获取了数据,但是表示存在问题。

Should I use lxml.fromstring instead and use the encoding parameter? 我应该改用lxml.fromstring并使用encoding参数吗?

Thanks, g0m3z 谢谢,g0m3z

Solution: 解:

Actually there was no problem with my code but the representation of data. 实际上,我的代码没有问题,但数据表示没有问题。 The first code implementation works fine. 第一个代码实现工作正常。

I load the result to a dictionary and when I print the whole dictionary at one shot it shows unicode characters incorrectly. 我将结果加载到字典中,并且一次打印整个字典时,它会错误地显示unicode字符。 However, if I print only one item of the result dictionary based on a key it represents the unicode characters correctly. 但是,如果我仅根据一个键打印结果字典中的一项,则它可以正确表示Unicode字符。 So it works! 这样就行了! Interesting. 有趣。 Thanks for everyone on this thread for valuable comments! 感谢在此主题上的每个人的宝贵意见!

You should read the actual character encoding from the HTTP-headers (or HTML-meta-tags) and not guess it. 您应该从HTTP标头(或HTML-meta-tags)中读取实际的字符编码,而不要猜测。 This way you can avoid decoding errors. 这样可以避免解码错误。

You can try to use a library for de parsing of the request. 您可以尝试使用库来解析请求。 I recomend you BeautifulSoup. 我向您推荐BeautifulSoup。 This will handle with all the problems with the encode and is very easy to use it. 这将解决编码的所有问题,并且非常易于使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM