Python：readline 和 readlines 之間的性能

Question

我知道文件對象的 readlines 和 readline 之間的一般區別。 但是我比較好奇他們的表現有什么不同，所以我在這里做了一個測試。

import timeit

with open('test.txt', 'w') as f:
    f.writelines('\n'.join("Just a test case\tJust a test case2\tJust a test case3" for i in range(1000000)))

def a1():
    with open('test.txt', 'r') as f:
        for text in f.readlines():
            pass
def a2():
    with open('test.txt', 'r') as f:
        text = f.readline()
        while text:
            text = f.readline()


print(timeit.timeit(a1, number =100))
print(timeit.timeit(a2, number =100))

$python readline_vs_readlines.py
38.410646996984724
35.876863296027295

但這是為什么呢？ 我認為 io 更耗時，因此如果您讀取更多次而不是一次將其讀入內存，則需要更多時間。 所以從我在這里看到的，為什么我們仍然使用readlines ？ 如果文件很大而速度沒有提高，它會花費我們大量的內存嗎？

Answer 1

實際上在for循環中使用readline時甚至更慢：

import timeit

with open('test.txt', 'w') as fp:
    print(*("Just a test case" for i in range(1000000)), sep='\n', file=fp)


def a1():
    with open('test.txt', 'r') as f:
        for _ in f.readlines():
            pass


def a2():
    with open('test.txt', 'r') as f:
        while _ := f.readline():
            pass


def a3():
    with open('test.txt', 'r') as f:
        for _ in iter(f.readline, ''):
            pass


print(timeit.timeit(a1, number=50))
print(timeit.timeit(a2, number=50))
print(timeit.timeit(a3, number=50))

輸出：

10.9471131
10.282239
9.3618919

在相同的for循環上進行比較時，顯然a3方式比a1快，盡管它是 python 的 aganist Zen。

原因在於源代碼_pyio.py和iobase.c ：

當iobase.c不可用時，將使用純 python _pyio.py 。

def readlines(self, hint=None):
    """Return a list of lines from the stream.
    hint can be specified to control the number of lines read: no more
    lines will be read if the total size (in bytes/characters) of all
    lines so far exceeds hint.
    """
    if hint is None or hint <= 0:
        return list(self)
    n = 0
    lines = []
    for line in self:
        lines.append(line)
        n += len(line)
        if n >= hint:
            break
    return lines

它附加了它讀取的每一行——它們共享相同的readline ——它根本沒有加載整個文件。

這對於 C 實現iobase.c也是一樣的：

while (1) {
    Py_ssize_t line_length;
    PyObject *line = PyIter_Next(it);
    if (line == NULL) {
        if (PyErr_Occurred()) {
            goto error;
        }
        else
            break; /* StopIteration raised */
    }

    if (PyList_Append(result, line) < 0) {
        Py_DECREF(line);
        goto error;
    }
    line_length = PyObject_Size(line);
    Py_DECREF(line);
    if (line_length < 0) {
        goto error;
    }
    if (line_length > hint - length)
        break;
    length += line_length;
}

如您所見，它正在調用PyList_Append將結果附加到列表中。

PS 提醒一下，對於這么多字符串，join 有一半不好，使用print和sep ，建議使用file參數。 不要在不需要的地方加入字符串。

Answer 2

Readlines在開始循環之前將所有文本讀入內存，而readline在循環時自動一次讀取一個緩沖區。 這是對兩者之間內存比較的更好解釋。

Python：readline 和 readlines 之間的性能

問題描述

2 個解決方案

解決方案1
1 已采納 2020-09-26 14:42:15

解決方案2
0 2020-09-26 11:55:47

Python：readline 和 readlines 之間的性能

問題描述

2 個解決方案

解決方案1 1 已采納 2020-09-26 14:42:15

解決方案2 0 2020-09-26 11:55:47

解決方案1
1 已采納 2020-09-26 14:42:15

解決方案2
0 2020-09-26 11:55:47