简体   繁体   中英

Reading a text file in unicode from a URL?

I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

Can someone tell me what's going on? Any help would be appreciated. Thanks!

Yes, you're reading bytes from the file. What you must do is decode , not encode , the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.

word = unicode(line[0], "utf8")

You have to specify the encoding used in the file. If it's not utf8 , another good suspect might be latin1 . Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.

put below code at the top.

# coding: utf-8

actually supporting unicode is not easy for python. also recommand this article .

http://www.python.org/dev/peps/pep-0263

http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM