[英]Difference between open and codecs.open in Python
There are two ways to open a text file in Python:在 Python 中有两种打开文本文件的方法:
f = open(filename)
And和
import codecs
f = codecs.open(filename, encoding="utf-8")
When is codecs.open
preferable to open
? codecs.open
什么时候比open
codecs.open
可取?
Since Python 2.6, a good practice is to use io.open()
, which also takes an encoding
argument, like the now obsolete codecs.open()
.从 Python 2.6 开始,一个好的做法是使用
io.open()
,它也接受一个encoding
参数,就像现在已经过时的codecs.open()
。 In Python 3, io.open
is an alias for the open()
built-in.在 Python 3 中,
io.open
是open()
内置函数的别名。 So io.open()
works in Python 2.6 and all later versions, including Python 3.4.所以
io.open()
可以在 Python 2.6 和所有更高版本中使用,包括 Python 3.4。 See docs: http://docs.python.org/3.4/library/io.html见文档: http : //docs.python.org/3.4/library/io.html
Now, for the original question: when reading text (including "plain text", HTML, XML and JSON) in Python 2 you should always use io.open()
with an explicit encoding, or open()
with an explicit encoding in Python 3. Doing so means you get correctly decoded Unicode, or get an error right off the bat, making it much easier to debug.现在,对于原始问题:在 Python 2 中读取文本(包括“纯文本”、HTML、XML 和 JSON)时,您应该始终使用带有显式编码的
io.open()
带有显式编码的open()
在 Python 中3. 这样做意味着您可以正确解码 Unicode,或者立即得到错误,从而更容易调试。
Pure ASCII "plain text" is a myth from the distant past.纯 ASCII“纯文本”是遥远过去的神话。 Proper English text uses curly quotes, em-dashes, bullets, € (euro signs) and even diaeresis (¨).
正确的英文文本使用卷曲引号、破折号、项目符号、€(欧元符号)甚至分音符 (¨)。 Don't be naïve!
不要天真! (And let's not forget the Façade design pattern!)
(我们不要忘记 Facade 设计模式!)
Because pure ASCII is not a real option, open()
without an explicit encoding is only useful to read binary files.因为纯 ASCII 不是一个真正的选择,所以没有显式编码的
open()
只能用于读取二进制文件。
Personally, I always use codecs.open
unless there's a clear identified need to use open
**.就个人而言,我总是使用
codecs.open
除非有明确的确定需要使用open
**。 The reason is that there's been so many times when I've been bitten by having utf-8 input sneak into my programs.原因是有很多次我被 utf-8 输入潜入我的程序所困扰。 "Oh, I just know it'll always be ascii" tends to be an assumption that gets broken often.
“哦,我只知道它永远是 ascii”往往是一个经常被打破的假设。
Assuming 'utf-8' as the default encoding tends to be a safer default choice in my experience, since ASCII can be treated as UTF-8, but the converse is not true.根据我的经验,假设 'utf-8' 作为默认编码往往是一个更安全的默认选择,因为 ASCII 可以被视为 UTF-8,但反之则不然。 And in those cases when I truly do know that the input is ASCII, then I still do
codecs.open
as I'm a firm believer in "explicit is better than implicit" .在这些情况下,当我真正知道输入是 ASCII 时,我仍然会使用
codecs.open
因为我坚信“显式优于隐式” 。
** - in Python 2.x, as the comment on the question states in Python 3 open
replaces codecs.open
** - 在 Python 2.x 中,正如 Python 3
open
对问题的评论所述,替换了codecs.open
In Python 2 there are unicode strings and bytestrings.在 Python 2 中有 unicode 字符串和字节串。 If you just use bytestrings, you can read/write to a file opened with
open()
just fine.如果你只使用字节串,你可以读/写一个用
open()
的文件就好了。 After all, the strings are just bytes.毕竟,字符串只是字节。
The problem comes when, say, you have a unicode string and you do the following:例如,当您有一个 unicode 字符串并且您执行以下操作时,问题就出现了:
>>> example = u'Μου αρέσει Ελληνικά'
>>> open('sample.txt', 'w').write(example)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
So here obviously you either explicitly encode your unicode string in utf-8 or you use codecs.open
to do it for you transparently.因此,很明显,您要么在 utf-8 中明确编码您的 unicode 字符串,要么使用
codecs.open
为您透明地进行编码。
If you're only ever using bytestrings then no problems:如果您只使用字节串,那么没问题:
>>> example = 'Μου αρέσει Ελληνικά'
>>> open('sample.txt', 'w').write(example)
>>>
It gets more involved than this because when you concatenate a unicode and bytestring string with the +
operator you get a unicode string.它比这更复杂,因为当您使用
+
运算符连接 unicode 和 bytestring 字符串时,您会得到一个 unicode 字符串。 Easy to get bitten by that one.容易被那个咬。
Also codecs.open
doesn't like bytestrings with non-ASCII chars being passed in:此外
codecs.open
不喜欢传入非 ASCII 字符的字节codecs.open
:
codecs.open('test', 'w', encoding='utf-8').write('Μου αρέσει')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
The advice about strings for input/ouput is normally "convert to unicode as early as possible and back to bytestrings as late as possible".关于输入/输出字符串的建议通常是“尽可能早地转换为 unicode,并尽可能晚地转换回字节串”。 Using
codecs.open
allows you to do the latter very easily.使用
codecs.open
可以让您非常轻松地完成后者。
Just be careful that you are giving it unicode strings and not bytestrings that may have non-ASCII characters.请注意,您提供的是 unicode 字符串,而不是可能具有非 ASCII 字符的字节串。
当您需要打开具有特定编码的文件时,您将使用codecs
模块。
codecs.open
, i suppose, is just a remnant from the Python 2
days when the built-in open had a much simpler interface and fewer capabilities.我想
codecs.open
只是Python 2
天的残余,当时内置 open 具有更简单的界面和更少的功能。 In Python 2, built-in open
doesn't take an encoding argument, so if you want to use something other than binary mode or the default encoding, codecs.open was supposed to be used.在 Python 2 中,内置
open
不接受编码参数,因此如果您想使用二进制模式或默认编码以外的其他内容,则应该使用 codecs.open 。
In Python 2.6
, the io module came to the aid to make things a bit simpler.在
Python 2.6
, io 模块帮助使事情变得更简单。 According to the official documentation根据官方文档
New in version 2.6.
The io module provides the Python interfaces to stream handling.
Under Python 2.x, this is proposed as an alternative to the
built-in file object, but in Python 3.x it is the default
interface to access files and streams.
Having said that, the only use i can think of codecs.open
in the current scenario is for the backward compatibility.话虽如此,我能想到的
codecs.open
在当前场景中的唯一用途是向后兼容。 In all other scenarios (unless you are using Python < 2.6) it is preferable to use io.open
.在所有其他情况下(除非您使用 Python < 2.6),最好使用
io.open
。 Also in Python 3.x
io.open
is the same as built-in open
同样在
Python 3.x
io.open
与built-in open
相同
Note:笔记:
There is a syntactical difference between codecs.open
and io.open
as well. codecs.open
和io.open
之间也存在语法差异。
codecs.open
: codecs.open
:
open(filename, mode='rb', encoding=None, errors='strict', buffering=1)
io.open
: io.open
:
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True, opener=None)
When you want to load a binary file, use f = io.open(filename, 'b')
.当您要加载二进制文件时,请使用
f = io.open(filename, 'b')
。
For opening a text file, always use f = io.open(filename, encoding='utf-8')
with explicit encoding.要打开文本文件,请始终使用带有显式编码的
f = io.open(filename, encoding='utf-8')
。
In python 3 however open
does the same thing as io.open
and can be used instead.然而,在python 3中
open
与io.open
做同样的事情,可以代替使用。
Note:
codecs.open
is planned to become deprecated and replaced byio.open
after its introduction in python 2.6 .注意:
codecs.open
计划在python 2.6 中引入后被弃用并由io.open
取代。 I would only use it if code needs to be compatible with earlier python versions.如果代码需要与早期的 Python 版本兼容,我只会使用它。 For more information on codecs and unicode in python see the Unicode HOWTO .
有关 Python 中编解码器和 unicode 的更多信息,请参阅Unicode HOWTO 。
当您处理文本文件并希望透明编码和解码为 Unicode 对象时。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.