I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.
def load(url):
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(url)
f = urlopen(req)
f.readline()
for line in f:
line = line.split('\t')
word = line[0].encode('utf-8')
I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)
Can someone tell me what's going on? Any help would be appreciated. Thanks!
Yes, you're reading bytes from the file. What you must do is decode , not encode , the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.
word = unicode(line[0], "utf8")
You have to specify the encoding used in the file. If it's not utf8
, another good suspect might be latin1
. Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.
put below code at the top.
# coding: utf-8
actually supporting unicode is not easy for python. also recommand this article .
http://www.python.org/dev/peps/pep-0263
http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.