简体   繁体   English

在Python中,read()或readlines()更快吗?

[英]In Python, is read() , or readlines() faster?

I want to read a huge file in my code. 我想在我的代码中读取一个巨大的文件。 Is read() or readline() faster for this. read()或readline()更快。 How about the loop: 循环怎么样:

for line in fileHandle

For a text file just iterating over it with a for loop is almost always the way to go. 对于一个文本文件,只需用for循环迭代就可以了。 Never mind about speed, it is the cleanest. 没关系速度,它是最干净的。

In some versions of python readline() really does just read a single line while the for loop reads large chunks and splits them up into lines so it may be faster. 在某些版本的python中, readline()确实只读取一行,而for循环读取大块并将它们分成行,因此它可能更快。 I think that more recent versions of Python use buffering also for readline() so the performance difference will be minuscule ( for is probably still microscopically faster because it avoids a method call). 我认为最近的Python版本使用缓冲也为readline()这样的表现差异将是微不足道的( for很可能还是微观更快,因为它避免了一个方法调用)。 However choosing one over the other for performance reasons is probably premature optimisation. 然而,出于性能原因选择一个而不是另一个可能是过早的优化。

Edit to add: I just checked back through some Python release notes. 编辑添加:我刚刚查看了一些Python发行说明。 Python 2.5 said: Python 2.5说:

It's now illegal to mix iterating over a file with for line in file and calling the file object's read()/readline()/readlines() methods. 将迭代文件与for line in file混合并调用文件对象的read()/ readline()/ readlines()方法现在是非法的。

Python 2.6 introduced TextIOBase which supports both iterating and readline() simultaneously. Python 2.6引入了TextIOBase,它同时支持iterating和readline()

Python 2.7 fixed interleaving read() and readline() . Python 2.7修复了交错read()readline()

If file is huge, read() is definitevely bad idea, as it loads (without size parameter), whole file into memory. 如果文件很大,read()肯定是个坏主意,因为它将整个文件加载到内存中(没有大小参数)。

Readline reads only one line at time, so I would say that is better choice for huge files. Readline只读取一行,所以我认为这是大文件的更好选择。

And just iterating over file object should be as effective as using readline. 只是迭代文件对象应该与使用readline一样有效。

See http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects for more info 有关详细信息,请参阅http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects

The docs for readlines indicate there is an optional sizehint. readlines的文档表明有一个可选的sizehint。 Because it is so vague, it's easy to overlook, but I found this to often be the fastest way to read files. 因为它是如此模糊,很容易被忽视,但我发现这通常是读取文件的最快方式。 Use readlines(1), which hints one line, but in fact reads in about 4k or 8k worth of lines IIRC. 使用readlines(1),它提示一行,但实际上读取大约4k或8k的行IIRC。 This takes advantage of the OS buffering and reduces the number of calls somewhat without using an excessive amount of memory. 这利用了OS缓冲并在不使用过多内存的情况下减少了调用次数。

You can experiment with different sizes of the sizehint, but I found 1 to be optimal on my platform when I was testing this 您可以尝试使用不同大小的sizehint,但我在测试时发现1在我的平台上是最佳的

read() basically is trying to read the whole file and save it into a single string to be used later while readlines() is also trying to read the whole file but it will do a split("\\n") and store the strings of lines into a list. read()基本上是尝试读取整个文件并将其保存到单个字符串中以便稍后使用,而readlines()也尝试读取整个文件但它会进行拆分(“\\ n”)并存储字符串将行放入列表中。 Hence, these two methods are not preferred if the file size is excessively big. 因此,如果文件大小过大,则不优选这两种方法。

readline() and for loop (iefor line in file:) will read one line at a time and store it into a string. readline()和for循环(即文件中的行:)将一次读取一行并将其存储到字符串中。 I guess they will use the same time to finish the job if memory allows. 如果内存允许,我猜他们会用同一时间完成这项工作。 However these two are preferred if the file size is huge. 但是,如果文件大小很大,则首选这两个。

If you have enough memory use readline if performance is a concern. 如果你有足够的内存使用readline,如果性能是一个问题。 I have seen that while using a gzip file doing: read().split('\\n') took 5 seconds to loop through, whereas using the iterator took 38 seconds. 我已经看到了使用gzip文件时: read().split('\\n')需要5秒才能循环,而使用迭代器需要38秒。 The size of GZ file was around 45 MB. GZ文件的大小约为45 MB。

The real difference between read() and readlines() The read function simply loads the file as is into memory. read()和readlines()之间的真正区别read函数只是将文件原样加载到内存中。 The readlines method reads the file as a list of lines without line termination. readlines方法将文件读取为没有行终止的行列表。 The readlines method should only be used on text files, and neither should be used on large files. readlines方法只应用于文本文件,并且不应在大文件上使用。 If copying the information from a text file, read works well, because it can be output with a the write function without the need to add line termination. 如果从文本文件中复制信息,则读取效果很好,因为可以使用写入功能输出,而无需添加行终止。

If your file is a text file then use readlines() which is obviously the way to read file containing lines. 如果您的文件是文本文件,那么使用readlines(),这显然是读取包含行的文件的方式。 Apart from that: perform benchmarks if you are really aware of possible performance problems. 除此之外:如果您真的意识到可能的性能问题,请执行基准测试。 I doubt that you will encounter any issues....the speed of the filesystem should be the limiting factor. 我怀疑你会遇到任何问题....文件系统的速度应该是限制因素。

Neither. 都不是。 Both of them will read the content into memory. 他们俩都会将内容读入内存。 In case of big files, iterating over the file object only loads one line of your file at a time and is perhaps a good way to deal with the contents of a huge file. 对于大文件,迭代文件对象一次只加载一行文件,这可能是处理大文件内容的好方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM