简体   繁体   English

从URL读取Unicode文本文件?

[英]Reading a text file in unicode from a URL?

I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on. 我正在尝试使用urllib和urllib2从其中包含法语字符的文本文件中进行读取,例如“é”,“à”等。

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error 我感觉到read()方法返回了一个字节字符串,因此我使用encode('utf-8')来获取unicode值,但这给了我以下错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

Can someone tell me what's going on? 有人可以告诉我怎么回事吗? Any help would be appreciated. 任何帮助,将不胜感激。 Thanks! 谢谢!

Yes, you're reading bytes from the file. 是的,您正在从文件读取字节。 What you must do is decode , not encode , the byte string into Unicode. 您必须做的是将字节字符串解码而不是编码成Unicode。 It's already encoded, you see. 已经看到它已经编码了。 If it wasn't, you wouldn't need to do anything with it. 如果不是,您将不需要执行任何操作。

word = unicode(line[0], "utf8")

You have to specify the encoding used in the file. 您必须指定文件中使用的编码 If it's not utf8 , another good suspect might be latin1 . 如果不是utf8 ,则另一个好怀疑者可能是latin1 Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question. 或者,您知道,由于它是一个Web文档,因此可以从标题和/或其内容中剔除该文档的编码,但这超出了您的问题范围。

put below code at the top. 将下面的代码放在顶部。

# coding: utf-8

actually supporting unicode is not easy for python. 对于python实际上支持unicode并不容易。 also recommand this article . 也推荐这篇文章。

http://www.python.org/dev/peps/pep-0263 http://www.python.org/dev/peps/pep-0263

http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9 http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM