简体   繁体   English

在 Python 2.7 中,为什么字符串在文本模式下比在二进制模式下写得快?

[英]In Python 2.7 why are strings written faster in text mode than in binary mode?

The following example script writes some strings to a file using either "w" , text, or "wb" , binary mode:以下示例脚本使用"w" 、文本或"wb"二进制模式将一些字符串写入文件:

import itertools as it
from string import ascii_lowercase
import time

characters = it.cycle(ascii_lowercase)
mode = 'w'
# mode = 'wb'  # using this mode takes longer to execute
t1 = time.clock()
with open('test.txt', mode) as fh:
    for __ in xrange(10**7):
        fh.write(''.join(it.islice(characters, 0, 50)))
t2 = time.clock()
print 'Mode: {}, time elapsed: {:.2f}'.format(mode, t2 - t1)

With Python 2, using "w" mode I found it executes in 24.89 +/- 0.02 s while using "wb" it takes 25.67 +/- 0.02 s to execute.使用 Python 2,使用"w"模式,我发现它在24.89 +/- 0.02 s内执行,而使用"wb"则需要25.67 +/- 0.02 s来执行。 These are the specific timings for three consecutive runs for each mode:以下是每种模式下连续运行三轮的具体时间:

mode_w  = [24.91, 24.86, 24.91]
mode_wb = [25.68, 25.64, 25.69]

I'm surprised by these results since Python 2 stores its strings anyway as binary strings, so neither "w" nor "wb" need to perform any encoding work.我对这些结果感到惊讶,因为 Python 2 将其字符串存储为二进制字符串,因此"w""wb"都不需要执行任何编码工作。 Text mode on the other hand needs to perform additional work such as checking for line endings :另一方面,文本模式需要执行额外的工作,例如检查行尾

The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading.默认是使用文本模式,它可以在写入和读取时将'\n'字符转换为特定于平台的表示。

So if anything I'd expect text mode "w" to take longer than binary mode "wb" .因此,如果有什么我希望文本模式"w"比二进制模式"wb"花费更长的时间。 However the opposite seems to be the case.然而,情况似乎恰恰相反。 Why is this?为什么是这样?


Tested with CPython 2.7.12使用 CPython 2.7.12 测试

Looking at the source code for file.write reveals the following difference between binary mode and text mode :查看file.write的源代码可以发现二进制模式文本模式之间的以下区别:

if (f->f_binary) {
    if (!PyArg_ParseTuple(args, "s*", &pbuf))
        return NULL;
    s = pbuf.buf;
    n = pbuf.len;
}
else {
    PyObject *text;
    if (!PyArg_ParseTuple(args, "O", &text))
        return NULL;

    if (PyString_Check(text)) {
        s = PyString_AS_STRING(text);
        n = PyString_GET_SIZE(text);
    }

Here f->f_binary is set when the mode for open includes "b" .open的模式包含"b"时,此处设置f->f_binary In this case Python constructs an auxiliary buffer object from the string object and then gets the data s and length n from that buffer.在这种情况下,Python 从字符串 object 构造一个辅助缓冲区 object,然后从该缓冲区获取数据s和长度n I suppose this is for compatibility (generality) with other objects that support the buffer interface.我想这是为了与支持缓冲区接口的其他对象的兼容性(通用性)。

Here PyArg_ParseTuple(args, "s*", &pbuf) creates the corresponding buffer object .这里PyArg_ParseTuple(args, "s*", &pbuf)创建相应的缓冲区 object This operation requires additional compute time while when working with text mode, Python simply parses the argument as an Object ( "O" ) at almost no cost.此操作需要额外的计算时间,而在使用文本模式时,Python 只需将参数解析为 Object ( "O" )几乎没有成本。 Retrieving the data and length via通过检索数据和长度

s = PyString_AS_STRING(text);
n = PyString_GET_SIZE(text);

is also performed when the buffer is created . 在创建缓冲区时也会执行。

This means that when working in binary mode there's an additional overhead associated with creating an auxiliary buffer object from the string object.这意味着在二进制模式下工作时,会产生与从字符串 object 创建辅助缓冲区 object 相关的额外开销。 For that reason the execution time is longer when working in binary mode.因此,在二进制模式下工作时执行时间更长。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM