简体繁体 English

python中的快速迭代文件读取

[英]Fast iterative file reading in python

原文 2020-11-07 15:31:52 7 1 python/ pandas/ dataframe/ memory/ chunking

I asked a question here about how to read in a very large file to python, and I got a response based on zip_longest.我在这里问了一个关于如何将一个非常大的文件读入 python 的问题，我得到了一个基于 zip_longest 的响应。

The problem is that this solution is extremely slow - it took keras' model.predict >2 hours to process 200,000 lines in a file which normally takes <3 minutes when the file is loaded directly into memory, and I want to be able to process files 5x this size.问题是这个解决方案非常慢——它需要 keras 的 model.predict >2 小时来处理文件中的 200,000 行，当文件直接加载到内存中时，通常需要 <3 分钟，我希望能够处理文件大小的 5 倍。

I've since found the chunking functions in pandas but I don't understand how to load a chunk of a file, reshape the data and then use it using these methods, and I also don't know if this will be the fastest way of reading and using the data in a very large file.从那以后，我在 Pandas 中找到了分块功能，但我不明白如何加载一大块文件，重塑数据，然后使用这些方法使用它，我也不知道这是否是最快的方法读取和使用非常大的文件中的数据。

Any fast solutions to this problem are welcome.欢迎任何对此问题的快速解决方案。

1 个解决方案

If you are looking for fast performing iterative python functions, you should check out the itertools package + documentation.如果您正在寻找快速执行的迭代 Python 函数，您应该查看itertools包 + 文档。 Im pretty sure it doesn't get much faster than that.我很确定它不会比这快得多。

But be aware that - if you neglect any kind of preprocessing or reshaping- you will hit a maximum of performance when dealing with large datasets.但请注意 - 如果您忽略任何类型的预处理或重塑 - 在处理大型数据集时您将达到最大性能。 Just imagine your 2e5 lines file contains only one character (1 Byte) of information.想象一下您的 2e5 行文件只包含一个字符（1 个字节）的信息。 That still makes 200 MB of information to read, which is the lower bound imaginable for you, if i get that correctly.这仍然需要阅读 200 MB 的信息，这是您可以想象的下限，如果我理解正确的话。 So you will have to face long interpreting times if you get that amount to 3 or 4 GB of information in one go.因此，如果您一次性获得 3 或 4 GB 的信息，您将不得不面对很长的口译时间。