简体   繁体   中英

Python: Performance between readline and readlines

I know the generally difference between readlines and readline of file object. But Im more curious how their performance differ from each other, thus I made a test here.

import timeit

with open('test.txt', 'w') as f:
    f.writelines('\n'.join("Just a test case\tJust a test case2\tJust a test case3" for i in range(1000000)))

def a1():
    with open('test.txt', 'r') as f:
        for text in f.readlines():
            pass
def a2():
    with open('test.txt', 'r') as f:
        text = f.readline()
        while text:
            text = f.readline()


print(timeit.timeit(a1, number =100))
print(timeit.timeit(a2, number =100))
$python readline_vs_readlines.py
38.410646996984724
35.876863296027295

But Why is that? I think io is more time consuming so if you read more times instead of read it into the memory in once, it takes more time. So from what im seeing here, why we use readlines anyway? It costs us enormous amount of memory if file is large with no gain on speed?

Actually even slower when readline was used in for loop:

import timeit

with open('test.txt', 'w') as fp:
    print(*("Just a test case" for i in range(1000000)), sep='\n', file=fp)


def a1():
    with open('test.txt', 'r') as f:
        for _ in f.readlines():
            pass


def a2():
    with open('test.txt', 'r') as f:
        while _ := f.readline():
            pass


def a3():
    with open('test.txt', 'r') as f:
        for _ in iter(f.readline, ''):
            pass


print(timeit.timeit(a1, number=50))
print(timeit.timeit(a2, number=50))
print(timeit.timeit(a3, number=50))

output:

10.9471131
10.282239
9.3618919

When comparing on same for loop, Clearly a3 way is faster than a1 , although it's aganist Zen of python.


Reason for this lies in source code _pyio.py and iobase.c :

When iobase.c is unavailable pure-python _pyio.py will be used.

def readlines(self, hint=None):
    """Return a list of lines from the stream.
    hint can be specified to control the number of lines read: no more
    lines will be read if the total size (in bytes/characters) of all
    lines so far exceeds hint.
    """
    if hint is None or hint <= 0:
        return list(self)
    n = 0
    lines = []
    for line in self:
        lines.append(line)
        n += len(line)
        if n >= hint:
            break
    return lines

It's appending each line it reads - which share same mechanics for readline - it's not loading up entire file at all.

This is also same for C implementation iobase.c :

while (1) {
    Py_ssize_t line_length;
    PyObject *line = PyIter_Next(it);
    if (line == NULL) {
        if (PyErr_Occurred()) {
            goto error;
        }
        else
            break; /* StopIteration raised */
    }

    if (PyList_Append(result, line) < 0) {
        Py_DECREF(line);
        goto error;
    }
    line_length = PyObject_Size(line);
    Py_DECREF(line);
    if (line_length < 0) {
        goto error;
    }
    if (line_length > hint - length)
        break;
    length += line_length;
}

As you see it's calling PyList_Append to append results to list.


PS Just a reminder, joining is half bad as concatenating for this many strings, use print with sep , file parameter is recommended. Do not join strings where it's not needed.

Readlines reads all the text into memory before starting the loop, while readline just reads a buffer at a time, automatically, while looping. Here's a better explanation of the memory comparison between the two.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM