Reading a text file in unicode from a URL?

Question

I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

Can someone tell me what's going on? Any help would be appreciated. Thanks!

Answer 1

Yes, you're reading bytes from the file. What you must do is decode , not encode , the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.

word = unicode(line[0], "utf8")

You have to specify the encoding used in the file. If it's not utf8 , another good suspect might be latin1 . Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.

Answer 2

put below code at the top.

# coding: utf-8

actually supporting unicode is not easy for python. also recommand this article .

http://www.python.org/dev/peps/pep-0263

http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9

Reading a text file in unicode from a URL?

Question

2 answers

solution1
5 ACCPTED 2012-02-09 06:53:20

solution2
0 2012-02-09 06:24:10

Reading a text file in unicode from a URL?

Question

2 answers

solution1 5 ACCPTED 2012-02-09 06:53:20

solution2 0 2012-02-09 06:24:10

solution1
5 ACCPTED 2012-02-09 06:53:20

solution2
0 2012-02-09 06:24:10