简体   繁体   English

使用python中的编解码器utf-8文件打开错误

[英]File open error by using codec utf-8 in python

I execute following code on windows xp and python 2.6.4 我在windows xp和python 2.6.4上执行以下代码

But it show IOError. 但它显示IOError。

How to open file whose name has utf-8 codec. 如何打开名称为utf-8编解码器的文件。

>>> open( unicode('한글.txt', 'euc-kr').encode('utf-8') )

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
IOError: [Errno 22] invalid mode ('r') or filename: '\xed\x95\x9c\xea\xb8\x80.txt'

But the following code to the normal operation. 但是下面的代码才能正常运行。

>>> open( unicode('한글.txt', 'euc-kr') )
<open file u'\ud55c\uae00.txt', mode 'r' at 0x01DD63E0>

The C runtime interface that Windows exposes to Python uses the system code page to encode filenames. Windows向Python公开的C运行时接口使用系统代码页来编码文件名。 Unlike on OS X and modern Linux versions, on Windows the system code page is never UTF-8. 与OS X和现代Linux版本不同,在Windows上,系统代码页永远不会是UTF-8。 So the UTF-8 byte string won't be any good. 所以UTF-8字节字符串不会有任何好处。

You could encode the filename to the current code page using .encode('mbcs') , which in your case is probably equivalent to .encode('cp949') . 您可以使用.encode('mbcs')将文件名编码为当前代码页,在您的情况下,它可能等同于.encode('cp949') To make it compatible with other platforms where filenames are UTF-8, you could look up sys.getfilesystemencoding , which will give you utf-8 there or mbcs on Windows. 为了使其与文件sys.getfilesystemencoding UTF-8的其他平台兼容,您可以查找sys.getfilesystemencoding ,它将为您提供utf-8或Windows上的mbcs

However whilst cp949 would work for Korean characters, it would break on anything outside the repertoire of that code page (an extended version of EUC-KR). 然而,虽然cp949适用于韩文字符,但它会破坏该代码页的所有内容(EUC-KR的扩展版本)之外的任何内容。

So another approach is to keep your filenames as Unicode. 所以另一种方法是将文件名保持为Unicode。 On Windows this will use the Unicode-native interfaces to pass filenames to Windows in the UTF-16LE encoding it uses internally. 在Windows上,这将使用Unicode本机接口以内部使用的UTF-16LE编码将文件名传递给Windows。 (See PEP277 for more on this feature.) (有关此功能的更多信息,请参阅PEP277 。)

This does generally still work on other platforms too: Linux and OS X should silently encode the Unicode filenames to UTF-8 for you. 这通常仍然可以在其他平台上运行:Linux和OS X应该为您静默编码Unicode文件名为UTF-8。 This may fail more in older Python versions, but it's the default way to handle filenames in Python 3 (where the default string type has changed to Unicode). 在较旧的Python版本中,这可能会失败,但它是在Python 3中处理文件名的默认方式(默认字符串类型已更改为Unicode)。

The traps to watch out for with using Unicode filenames on Python 2 are: 在Python 2上使用Unicode文件名时要注意的陷阱是:

  • if os.path.supports_unicode_filenames is False, as it will be outside Windows, the functions that return filenames, such as os.listdir , will always give you byte strings. 如果os.path.supports_unicode_filenames为False,因为它将在Windows之外,返回文件名的函数(如os.listdir )将始终为您提供字节字符串。 You'd have to detect that and decode them using sys.getfilesystemencoding . 您必须使用sys.getfilesystemencoding检测并解码它们。

  • if you have a file on Linux/OS X with a name that's not a valid UTF-8 string, you won't be able to get a Unicode filename for it (UnicodeDecodeError if you try). 如果您的Linux / OS X上的文件名称不是有效的UTF-8字符串,则无法为其获取Unicode文件名(如果您尝试,则为UnicodeDecodeError)。 Bit of a corner case, but it can lead to annoying inaccessible files. 一个角落的案例,但它可能导致烦人的无法访问的文件。

Incidentally, 偶然,

open(unicode('한글.txt', 'euc-kr'))

Probably you would want to say 'cp949' there (as the Windows Korean code page has minor differences to EUC-KR). 可能你想在那里说'cp949' (因为Windows韩语代码页与EUC-KR有微小的差别)。 Or, more generally, 'mbcs' , which gives you the system code page which is presumably going to be the same one your console is typing. 或者,更一般地说, 'mbcs' ,它为您提供系统代码页,可能与您的控制台正在键入的系统代码页相同。 Anyway, I don't know about PyShell, but normally if the above works then you should just be able to type it directly: 无论如何,我不知道PyShell,但通常如果以上工作,那么你应该只能直接输入:

open(u'한글')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用 pandas 打开 .csv 文件(从 Outlook 下载); “utf-8”编解码器错误 - Unable to open .csv file using pandas (downloaded from outlook); 'utf-8' codec error 使用 UTF-8 字符串写入文件时出现 Python 编解码器错误 - Python codec error during file write with UTF-8 string Python 3 CSV 文件给出 UnicodeDecodeError: &#39;utf-8&#39; 编解码器在打印时无法解码字节错误 - Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print Python和Ansible - 编解码器错误UTF-8到ascii转换 - Python and Ansible - codec error UTF-8 to ascii conversion 错误 n 读取 csv 文件:utf-8 编解码器无法解码 - Error n reading csv file: utf-8 codec cant decode 使用Python在utf-8中打开CSV文件 - Open csv file in utf-8 with Python 使用UTF-8打开文件进行读取 - Using UTF-8 to open file for reading UnicodeDecodeError:'utf-8'编解码器无法解码 position 0 中的字节 0xff:读取 csv 时 python 中的无效起始字节错误 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte error in python while reading a csv file “utf-8”编解码器无法解码位置 2912 中的字节 0xd5:在 Python 中读取 csv 文件时出现无效的连续字节错误 - 'utf-8' codec can't decode byte 0xd5 in position 2912: invalid continuation byte Error when reading csv file in Python 在Windows上使用python错误:UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置110的字节0x80:无效的起始字节 - using python on windows error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM