简体   繁体   English

“file.readlines()”、“list(file)”和“file.read().splitlines(True)”之间有区别吗?

[英]Is there a difference between : “file.readlines()”, “list(file)” and “file.read().splitlines(True)”?

What is the difference between :有什么区别:

with open("file.txt", "r") as f:
    data = list(f)

Or :或者 :

with open("file.txt", "r") as f:
    data = f.read().splitlines(True)

Or :或者 :

with open("file.txt", "r") as f:
    data = f.readlines()

They seem to produce the exact same output.它们似乎产生完全相同的输出。 Is one better (or more pythonic) than the other ?一个比另一个更好(或更pythonic)吗?

Explicit is better than implicit, so I prefer:显式优于隐式,所以我更喜欢:

with open("file.txt", "r") as f:
    data = f.readlines()

But, when it is possible, the most pythonic is to use the file iterator directly, without loading all the content to memory, eg:但是,如果可能,最pythonic 的是直接使用文件迭代器,而不将所有内容加载到内存中,例如:

with open("file.txt", "r") as f:
    for line in f:
       my_function(line)

TL;DR; TL; 博士;

Considering you need a list to manipulate them afterwards, your three proposed solutions are all syntactically valid.考虑到您之后需要一个列表来操作它们,您提出的三个解决方案在语法上都是有效的。 There is no better (or more pythonic) solution, especially since they all are recommended by the official Python documentation .没有更好(或更多 Pythonic)的解决方案,特别是因为它们都是官方 Python 文档推荐的。 So, choose the one you find the most readable and be consistent with it throughout your code.因此,请选择您认为最易读的一个,并在整个代码中与其保持一致 If performance is a deciding factor, see my timeit analysis below.如果性能是一个决定性的因素,请参阅我的timeit下面的分析。


Here is the timeit (10000 loops, ~20 line in test.txt ),这里是timeit (10000个环路,在〜20线test.txt ),

import timeit

def foo():
    with open("test.txt", "r") as f:
        data = list(f)

def foo1():
    with open("test.txt", "r") as f:
        data = f.read().splitlines(True)

def foo2():
    with open("test.txt", "r") as f:
        data = f.readlines()

print(timeit.timeit(stmt=foo, number=10000))
print(timeit.timeit(stmt=foo1, number=10000))
print(timeit.timeit(stmt=foo2, number=10000))

>>>> 1.6370758459997887
>>>> 1.410844805999659
>>>> 1.8176437409965729

I tried it with multiple number of loops and lines, and f.read().splitlines(True) always seems to be performing a bit better than the two others.我尝试了多个循环和行,并且f.read().splitlines(True)似乎总是比其他两个表现好一点。

Now, syntactically speaking, all of your examples seems to be valid.现在,从句法上讲,您的所有示例似乎都是有效的。 Refer to this documentation for more informations.有关更多信息,请参阅此文档

According to it, if your goal is to read lines form a file,根据它,如果您的目标是从文件中读取行,

for line in f:
    ...

where they states that it is memory efficient, fast, and leads to simple code .他们声称它具有内存效率,速度快,并且代码简单 Which would be another good alternative in your case if you don't need to manipulate them in a list.如果您不需要在列表中操作它们,这将是您的另一个不错的选择

EDIT编辑

Note that you don't need to pass your True boolean to splitlines .请注意,您不需要将True布尔值传递给splitlines It has your wanted behavior by default.默认情况下,它具有您想要的行为。

My personal recommendation我的个人推荐

I don't want to make this answer too opinion-based, but I think it would be beneficial for you to know, that I don't think performance should be your deciding factor until it is actually a problem for you.我不想让这个答案过于基于意见,但我认为这对你来说是有益的,我认为性能不应该是你的决定因素,直到它实际上对你来说是一个问题。 Especially since all syntax are allowed and recommended in the official Python doc I linked.特别是因为在我链接的官方 Python 文档中允许和推荐所有语法。

So, my advice is,:所以,我的建议是:

First, pick the most logical one for your particular case and then choose the one you find the most readable and be consistent with it throughout your code.首先,为您的特定情况选择最合乎逻辑的一个,然后选择您认为最易读的一个,并在整个代码中与其保持一致

They're all achieving the same goal of returning a list of strings but using separate approaches.它们都实现了返回字符串列表的相同目标,但使用不同的方法。 f.readlines() is the most Pythonic. f.readlines()是最 Pythonic 的。

with open("file.txt", "r") as f:
    data = list(f)

f here is a file-like object, which is being iterated over through list , which returns lines in the file. f这里是一个类似文件的对象,它通过list迭代,它返回文件中的行。


with open("file.txt", "r") as f:
    data = f.read().splitlines(True)

f.read() returns a string, which you split on newlines, returning a list of strings. f.read()返回一个字符串,您将其拆分为换行符,返回一个字符串列表。


with open("file.txt", "r") as f:
    data = f.readlines()

f.readlines() does the same as above, it reads the entire file and splits on newlines. f.readlines()与上述相同,它读取整个文件并在换行符处拆分。

All three of your options produce the same end result, but nonetheless, one of them is definitely worse than the other two: doing f.read().splitlines(True) .您的所有三个选项都会产生相同的最终结果,但尽管如此,其中一个肯定比其他两个更糟糕:执行f.read().splitlines(True)

The reason this is the worst option is that it requires the most memory.这是最糟糕的选择的原因是它需要最多的内存。 f.read() reads the file content into memory as a single (maybe huge) string object, then calling .splitlines(True) on that additionally creates the list of the individual lines, and then only after that does the string object containing the file's entire content get garbage collected and its memory freed. f.read()将文件内容作为单个(可能是巨大的)字符串对象读取到内存中,然后调用.splitlines(True)额外创建各个行的列表,然后仅在此之后才包含包含文件的全部内容被垃圾收集并释放其内存。 So, at the moment of peak memory use - just before the memory for the big string is freed - this approach requires enough memory to store the entire content of the file in memory twice - once as a string, and once as an array of strings.因此,在内存使用高峰时刻——就在大字符串的内存被释放之前——这种方法需要足够的内存来将文件的整个内容存储在内存中两次——一次作为字符串,一次作为字符串数组.

By contrast, doing list(f) or f.readlines() will read a line from disk, add it to the result list, then read the next line, and so on.相比之下,执行list(f)f.readlines()将从磁盘读取一行,将其添加到结果列表中,然后读取下一行,依此类推。 So the whole file content is never duplicated in memory, and the peak memory use will thus be about half that of the .splitlines(True) approach.因此整个文件内容永远不会在内存中重复,因此峰值内存使用量将是.splitlines(True)方法的一半左右。 These approaches are thus superior to using .read() and .splitlines(True) .因此,这些方法优于使用.read().splitlines(True)

As for list(f) vs f.readlines() , there's no concrete advantage to either of them over the other;至于list(f)f.readlines() ,它们中的任何一个都没有具体的优势; the choice between them is a matter of style and taste.它们之间的选择是风格和品味的问题。

In the 3 cases, you're using a context manager to read a file.在这 3 种情况下,您使用context manager来读取文件。 This file is a file object .这个文件是一个file object

File Object文件对象

An object exposing a file-oriented API (with methods such as read() or write()).公开面向文件的 API 的对象(使用 read() 或 write() 等方法)。 Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.).根据创建的方式,文件对象可以调解对真实磁盘文件或其他类型的存储或通信设备(例如标准输入/输出、内存缓冲区、套接字、管道等)的访问。 . File objects are also called file-like objects or streams.文件对象也称为类文件对象或流。 The canonical way to create a file object is by using the open() function.创建文件对象的规范方法是使用 open() 函数。 https://docs.python.org/3/glossary.html#term-file-object https://docs.python.org/3/glossary.html#term-file-object

list列表

with open("file.txt", "r") as f:
    data = list(f)

This works because your file object is a stream like object.这是有效的,因为您的文件对象是一个类似对象的流。 converting to list works roughly like this :转换为列表的工作方式大致如下:

[element for element in generator until I hit stopIteration]

readlines method读取线方法

with open("file.txt", "r") as f:
    data = f.readlines()

The method readlines() reads until EOF using readline() and returns a list containing the lines.方法 readlines() 使用 readline() 读取直到 EOF 并返回包含行的列表。

Difference with list :与列表的区别:

  1. You can specify the number of elements you want to read : fileObject.readlines( sizehint )您可以指定要读取的元素数量: fileObject.readlines( sizehint )

  2. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.如果存在可选的 sizehint 参数,而不是读取到 EOF,而是读取总计大约 sizehint 字节(可能在向上舍入到内部缓冲区大小之后)的整行。

read

When should I ever use file.read() or file.readlines()? 我什么时候应该使用 file.read() 或 file.readlines()?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM