简体   繁体   English

Python中open和codecs.open的区别

[英]Difference between open and codecs.open in Python

There are two ways to open a text file in Python:在 Python 中有两种打开文本文件的方法:

f = open(filename)

And

import codecs
f = codecs.open(filename, encoding="utf-8")

When is codecs.open preferable to open ? codecs.open什么时候比open codecs.open可取?

Since Python 2.6, a good practice is to use io.open() , which also takes an encoding argument, like the now obsolete codecs.open() .从 Python 2.6 开始,一个好的做法是使用io.open() ,它也接受一个encoding参数,就像现在已经过时的codecs.open() In Python 3, io.open is an alias for the open() built-in.在 Python 3 中, io.openopen()内置函数的别名。 So io.open() works in Python 2.6 and all later versions, including Python 3.4.所以io.open()可以在 Python 2.6 和所有更高版本中使用,包括 Python 3.4。 See docs: http://docs.python.org/3.4/library/io.html见文档: http : //docs.python.org/3.4/library/io.html

Now, for the original question: when reading text (including "plain text", HTML, XML and JSON) in Python 2 you should always use io.open() with an explicit encoding, or open() with an explicit encoding in Python 3. Doing so means you get correctly decoded Unicode, or get an error right off the bat, making it much easier to debug.现在,对于原始问题:在 Python 2 中读取文本(包括“纯文本”、HTML、XML 和 JSON)时,您应该始终使用带有显式编码的io.open()带有显式编码的open()在 Python 中3. 这样做意味着您可以正确解码 Unicode,或者立即得到错误,从而更容易调试。

Pure ASCII "plain text" is a myth from the distant past.纯 ASCII“纯文本”是遥远过去的神话。 Proper English text uses curly quotes, em-dashes, bullets, € (euro signs) and even diaeresis (¨).正确的英文文本使用卷曲引号、破折号、项目符号、€(欧元符号)甚至分音符 (¨)。 Don't be naïve!不要天真! (And let's not forget the Façade design pattern!) (我们不要忘记 Facade 设计模式!)

Because pure ASCII is not a real option, open() without an explicit encoding is only useful to read binary files.因为纯 ASCII 不是一个真正的选择,所以没有显式编码的open()只能用于读取二进制文件。

Personally, I always use codecs.open unless there's a clear identified need to use open **.就个人而言,我总是使用codecs.open除非有明确的确定需要使用open **。 The reason is that there's been so many times when I've been bitten by having utf-8 input sneak into my programs.原因是有很多次我被 utf-8 输入潜入我的程序所困扰。 "Oh, I just know it'll always be ascii" tends to be an assumption that gets broken often. “哦,我只知道它永远是 ascii”往往是一个经常被打破的假设。

Assuming 'utf-8' as the default encoding tends to be a safer default choice in my experience, since ASCII can be treated as UTF-8, but the converse is not true.根据我的经验,假设 'utf-8' 作为默认编码往往是一个更安全的默认选择,因为 ASCII 可以被视为 UTF-8,但反之则不然。 And in those cases when I truly do know that the input is ASCII, then I still do codecs.open as I'm a firm believer in "explicit is better than implicit" .在这些情况下,当我真正知道输入是 ASCII 时,我仍然会使用codecs.open因为我坚信“显式优于隐式”

** - in Python 2.x, as the comment on the question states in Python 3 open replaces codecs.open ** - 在 Python 2.x 中,正如 Python 3 open对问题的评论所述,替换了codecs.open

In Python 2 there are unicode strings and bytestrings.在 Python 2 中有 unicode 字符串和字节串。 If you just use bytestrings, you can read/write to a file opened with open() just fine.如果你只使用字节串,你可以读/写一个用open()的文件就好了。 After all, the strings are just bytes.毕竟,字符串只是字节。

The problem comes when, say, you have a unicode string and you do the following:例如,当您有一个 unicode 字符串并且您执行以下操作时,问题就出现了:

>>> example = u'Μου αρέσει Ελληνικά'
>>> open('sample.txt', 'w').write(example)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

So here obviously you either explicitly encode your unicode string in utf-8 or you use codecs.open to do it for you transparently.因此,很明显,您要么在 utf-8 中明确编码您的 unicode 字符串,要么使用codecs.open为您透明地进行编码。

If you're only ever using bytestrings then no problems:如果您只使用字节串,那么没问题:

>>> example = 'Μου αρέσει Ελληνικά'
>>> open('sample.txt', 'w').write(example)
>>>

It gets more involved than this because when you concatenate a unicode and bytestring string with the + operator you get a unicode string.它比这更复杂,因为当您使用+运算符连接 unicode 和 bytestring 字符串时,您会得到一个 unicode 字符串。 Easy to get bitten by that one.容易被那个咬。

Also codecs.open doesn't like bytestrings with non-ASCII chars being passed in:此外codecs.open不喜欢传入非 ASCII 字符的字节codecs.open

codecs.open('test', 'w', encoding='utf-8').write('Μου αρέσει')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

The advice about strings for input/ouput is normally "convert to unicode as early as possible and back to bytestrings as late as possible".关于输入/输出字符串的建议通常是“尽可能早地转换为 unicode,并尽可能晚地转换回字节串”。 Using codecs.open allows you to do the latter very easily.使用codecs.open可以让您非常轻松地完成后者。

Just be careful that you are giving it unicode strings and not bytestrings that may have non-ASCII characters.请注意,您提供的是 unicode 字符串,而不是可能具有非 ASCII 字符的字节串。

当您需要打开具有特定编码的文件时,您将使用codecs模块。

codecs.open , i suppose, is just a remnant from the Python 2 days when the built-in open had a much simpler interface and fewer capabilities.我想codecs.open只是Python 2天的残余,当时内置 open 具有更简单的界面和更少的功能。 In Python 2, built-in open doesn't take an encoding argument, so if you want to use something other than binary mode or the default encoding, codecs.open was supposed to be used.在 Python 2 中,内置open不接受编码参数,因此如果您想使用二进制模式或默认编码以外的其他内容,则应该使用 codecs.open 。

In Python 2.6 , the io module came to the aid to make things a bit simpler.Python 2.6 , io 模块帮助使事情变得更简单。 According to the official documentation根据官方文档

New in version 2.6.

The io module provides the Python interfaces to stream handling.
Under Python 2.x, this is proposed as an alternative to the
built-in file object, but in Python 3.x it is the default
interface to access files and streams.

Having said that, the only use i can think of codecs.open in the current scenario is for the backward compatibility.话虽如此,我能想到的codecs.open在当前场景中的唯一用途是向后兼容。 In all other scenarios (unless you are using Python < 2.6) it is preferable to use io.open .在所有其他情况下(除非您使用 Python < 2.6),最好使用io.open Also in Python 3.x io.open is the same as built-in open同样在Python 3.x io.openbuilt-in open相同

Note:笔记:

There is a syntactical difference between codecs.open and io.open as well. codecs.openio.open之间也存在语法差异。

codecs.open : codecs.open :

open(filename, mode='rb', encoding=None, errors='strict', buffering=1)

io.open : io.open :

open(file, mode='r', buffering=-1, encoding=None,
     errors=None, newline=None, closefd=True, opener=None)
  • When you want to load a binary file, use f = io.open(filename, 'b') .当您要加载二进制文件时,请使用f = io.open(filename, 'b')

  • For opening a text file, always use f = io.open(filename, encoding='utf-8') with explicit encoding.要打开文本文件,请始终使用带有显式编码的f = io.open(filename, encoding='utf-8')

In python 3 however open does the same thing as io.open and can be used instead.然而,在python 3openio.open做同样的事情,可以代替使用。

Note: codecs.open is planned to become deprecated and replaced by io.open after its introduction in python 2.6 .注意: codecs.open计划在python 2.6 中引入后被弃用并由io.open取代。 I would only use it if code needs to be compatible with earlier python versions.如果代码需要与早期的 Python 版本兼容,我只会使用它。 For more information on codecs and unicode in python see the Unicode HOWTO .有关 Python 中编解码器和 unicode 的更多信息,请参阅Unicode HOWTO

当您处理文本文件并希望透明编码和解码为 Unicode 对象时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM