简体   繁体   中英

Parse unicode characters from HTML element in Python

I have the following code to parse HTML sites. This returns an HTML element object. I would like to run this code on several machines so it's important for me to implement proxy when someone tries to run it from behind a proxy.

from lxml.html import parse

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    site = parse(conn).getroot()
else:
    site = parse(url).getroot()
return site

After it returns the HTML element I get data from the object by using Xpath expressions like this:

element = site.xpath(expression)

The problem is that the result contains non-unicode data, which contains escape characters. Eg:

\\xe1ci\\xf3s kombi

I tried this implementation as well but this one gives me an error:

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    rawdata = conn.read()
    parser = etree.HTMLParser(encoding="utf-8")
    site = etree.HTML(rawdata, parser=parser)
else:
    site = parse(url).getroot()
return site

And the error message is:

'utf8' decode can't decode byte 0xf3 in position 4: invalid continuation byte

The site is using iso-8859-2 charset.

Is there a way to convert non-unicode characters to unicode with one of the parsing methods listed above? Or maybe I'm getting something wrong and I'm getting the data in a correct format but I have problem with the representation.

Should I use lxml.fromstring instead and use the encoding parameter?

Thanks, g0m3z

Solution:

Actually there was no problem with my code but the representation of data. The first code implementation works fine.

I load the result to a dictionary and when I print the whole dictionary at one shot it shows unicode characters incorrectly. However, if I print only one item of the result dictionary based on a key it represents the unicode characters correctly. So it works! Interesting. Thanks for everyone on this thread for valuable comments!

You should read the actual character encoding from the HTTP-headers (or HTML-meta-tags) and not guess it. This way you can avoid decoding errors.

You can try to use a library for de parsing of the request. I recomend you BeautifulSoup. This will handle with all the problems with the encode and is very easy to use it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM