简体   繁体   中英

Help Replacing Non-ASCII character in Python

I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is the desired format.

How do I replace the 'Â ' with '&nbsp;' in Python? Thanks a lot!

You've got an encoding problem. Instead of trying to remove this characters, look for the encoding of the page, then when you read the file, use the codecs module instead of open() , using the proper character encoding.

filtered_content = filter(lambda x: x in string.printable, content)

This solved my problem. Thank you!

s.replace('Â ', '&nbsp;');

However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.

http://code.google.com/p/httplib2/wiki/ExamplesPython3

EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib , urllib2 , or httplib modules that are part of the Python 2.6 standard library?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM