Parse unicode characters from HTML element in Python

Question

I have the following code to parse HTML sites. This returns an HTML element object. I would like to run this code on several machines so it's important for me to implement proxy when someone tries to run it from behind a proxy.

from lxml.html import parse

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    site = parse(conn).getroot()
else:
    site = parse(url).getroot()
return site

After it returns the HTML element I get data from the object by using Xpath expressions like this:

element = site.xpath(expression)

The problem is that the result contains non-unicode data, which contains escape characters. Eg:

\\xe1ci\\xf3s kombi

I tried this implementation as well but this one gives me an error:

def parsepage(url):
if proxy:
    proxy_support = urllib2.ProxyHandler({"http":proxy})
    opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    conn = urllib2.urlopen(url)
    rawdata = conn.read()
    parser = etree.HTMLParser(encoding="utf-8")
    site = etree.HTML(rawdata, parser=parser)
else:
    site = parse(url).getroot()
return site

And the error message is:

'utf8' decode can't decode byte 0xf3 in position 4: invalid continuation byte

The site is using iso-8859-2 charset.

Is there a way to convert non-unicode characters to unicode with one of the parsing methods listed above? Or maybe I'm getting something wrong and I'm getting the data in a correct format but I have problem with the representation.

Should I use lxml.fromstring instead and use the encoding parameter?

Thanks, g0m3z

Solution:

Actually there was no problem with my code but the representation of data. The first code implementation works fine.

I load the result to a dictionary and when I print the whole dictionary at one shot it shows unicode characters incorrectly. However, if I print only one item of the result dictionary based on a key it represents the unicode characters correctly. So it works! Interesting. Thanks for everyone on this thread for valuable comments!

Answer 1

You should read the actual character encoding from the HTTP-headers (or HTML-meta-tags) and not guess it. This way you can avoid decoding errors.

Answer 2

You can try to use a library for de parsing of the request. I recomend you BeautifulSoup. This will handle with all the problems with the encode and is very easy to use it.

Parse unicode characters from HTML element in Python

Question

2 answers

solution1
0 2013-11-12 13:56:50

solution2
0 2013-11-12 14:07:26

Parse unicode characters from HTML element in Python

Question

2 answers

solution1 0 2013-11-12 13:56:50

solution2 0 2013-11-12 14:07:26

solution1
0 2013-11-12 13:56:50

solution2
0 2013-11-12 14:07:26