简体   繁体   English

在Python中读取大型文件(52mb)的行,是否更好地迭代行或使用readlines?

[英]Reading large file (52mb) of lines in Python, is it better to iterate the lines or use readlines?

I have a list of 4 million words in a txt file that I want to add to a list. 我有一个txt文件中有400万个单词的列表,我想添加到列表中。 I have two options: 我有两个选择:

l=[line for line in open(wordlist)]

or: 要么:

wordlist = file.readlines()

readlines() appears to be much faster, I'm guessing this is because the data is read into the memory in one go. readlines()似乎要快得多,我猜这是因为数据一次性读入内存。 The first option would be better for conserving memory because it reads one line at a time, is this true? 第一个选项对于节省内存会更好,因为它一次读取一行,这是真的吗? Does readlines() use any type of buffer when copying? readlines()在复制时是否使用任何类型的缓冲区? In general which is best to use? 一般哪种情况最好用?

Both options read the whole thing into memory in one big list. 这两个选项都将整个内容读入一个大的列表中。 The first option is slower because you delegate looping to Python bytecode. 第一个选项较慢,因为您将循环委托给Python字节码。 If you wanted to create one big list with all lines from your file, then there is no reason to use a list comprehension here. 如果你想用文件中的所有行创建一个大的列表,那么没有理由在这里使用列表推导。

I'd not use either . 也不使用。 Loop over the file and process the lines as you loop : 循环遍历文件并在循环时处理行

with open(wordlist) as fileobj:
    for line in fileobj:
        # do something with this line only.

There is usually no need to keep the whole unprocessed file data in memory. 通常不需要将整个未处理的文件数据保存在内存中。

I think the real answer is, it depends. 我认为真正的答案是,这取决于。

If you have the memory and it doesn't matter how much you use. 如果你有记忆,那么你使用多少并不重要。 Then you can by all means put all 4 million strings into a list with the readlines() methods. 然后你可以通过readlines()方法将所有400万个字符串放入一个列表中。 But then I would ask is it really necessary to keep them all in memory at once? 但后来我会问,是否真的有必要立刻将它们全部留在内存中?

Probably the more performant method would be to iterate over each line/word at a time, do something with that word (count, hashvectorize, etc) and then let the garbage collector take it to the dump. 可能性能更高效的方法是一次迭代每个行/单词,用该单词做一些事情(count,hashvectorize等),然后让垃圾收集器将它带到转储。 This method uses a generator which pops off one line at a time versus reading everything into memory unnecessarily. 此方法使用一次生成一行的生成器,而不必要地将所有内容读入内存。

A lot of the builtins in Python 3.* are moving to this generator style, one example is xrange vs range . Python 3. *中的许多内置函数正在转向这种生成器样式, 一个例子是xrange vs range

Considering you are doing a binary search on the list though so need to sort it first. 考虑到您正在列表上进行二进制搜索,但需要先对其进行排序。 , you need to read the data into a list and sort, on a file with 10 million random digits, calling readlines and an inplace .sort is slightly faster: ,你需要将数据读入一个列表并排序,在一个有1000万随机数字的文件上,调用readlines和一个inplace .sort稍快一点:

In [15]: %%timeit
with open("test.txt") as f:
     r = f.readlines()
     r.sort()
   ....: 
1 loops, best of 3: 719 ms per loop

In [16]: %%timeit
with open("test.txt") as f:
    sorted(f)
   ....: 
1 loops, best of 3: 776 ms per loop

In [17]: %%timeit
with open("test.txt") as f:
     r = [line for line in f] 
     r.sort()
   ....: 

1 loops, best of 3: 735 ms per loop

You have the same data in the list whatever approach you use so there is no memory advantage, the only difference is readlines is a bit more efficient than a list comp or calling sorted on the file object. 无论您使用何种方法,列表中都有相同的数据,因此没有内存优势,唯一的区别是readlines比列表comp或调用文件对象调用更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM