简体   繁体   中英

urllib2/lxml encoding problems

I'm new to python, and trying to use urllib2/lxml to fetch, and parse a page. Everything seems to work fine, except, the parsed page, when opened in my browser seems to have strange characters embedded in it. I'm guessing this is a unicode/lxml parsing problem. When I get the text content of an element, using .text_content(), and print it, I get stuff like: "sometext \\342\\200\\223 moretext" in the original page, this shows as "sometext - moretext"

Could anyone tell me:
1. what's going on?
2. how do I fix it?
3. where can I read up on encoding issues like these?

Thanks!

What is going on is that the website is using an "endash", which is a slightly longer dash (and the one you should use in ranges, like 40-56, really. Yeah, dashes is a whole science unto itself).

In Unicode, the endash has codepoint U+2013. The numbers you get, \\342\\200\\223 is the octal representation of the UTF-8 encoding of that codepoint. Why you get octal I don't know, I get hex, so on my computer it looks like '\\xe2\\x80\\x93'. But that makes no difference, that's just the respresentation. The numbers are the same.

What you probably should do is to decode the HTML string you get to unicode as early as possible. The headers you get back when you fetch the page should tell you what encoding it uses (although it's apparently UTF8 here), it's fairly easy to extract that data from the headers, you'll see it when you print out the headers.

You then decode the html data:

htmldata = htmldata.decode(<the encoding you found in the headers>)

You'll mainly need to be mindful of unicode issues at two points in the process:

  1. Get the response into a unicode string, nicely explained here on SO
  2. Specify a suitable encoding when outputting strings

--

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM