简体   繁体   English

处理 Python 中的大文本文件

[英]Handling big text files in Python

Basics are that I need to process 4gig text files on a per line basis.基础是我需要每行处理 4gig 文本文件。

using.readline() or for line in f is great for memory but takes ages to IO. using.readline() 或 for line in f 非常适合 memory 但需要很长时间才能使用 IO。 Would like to use something like yield, but that (I think) will chop lines.想使用像产量这样的东西,但是(我认为)会砍线。

POSSIBLE ANSWER:可能的答案:

file.readlines([sizehint])¶
 Read until EOF using readline() and return a list containing the lines

thus read.因此阅读。 If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.如果存在可选的 sizehint 参数,而不是读取到 EOF,而是读取总计大约 sizehint 字节的整行(可能在四舍五入到内部缓冲区大小之后)。 Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.如果 sizehint 无法实现或无法有效实现,则实现类文件接口的对象可能会选择忽略它。

Didn't realize you could do this!没想到你能做到!

You can just iterate over the file object:您可以只遍历文件 object:

with open("filename") as f:
    for line in f:
        whatever

This will do some internal buffering to improve the performance.这将做一些内部缓冲以提高性能。 (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline() .) (请注意, file.readline()将执行得相当糟糕,因为它不缓冲 - 这就是为什么您不能将文件 object 与file.readline()混合迭代。)

If you want to do something on a per-line basis you can just loop over the file object:如果你想在每行的基础上做一些事情,你可以遍历文件 object:

f = open("w00t.txt")
for line in f:
    # do stuff

However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size?但是,以每行为基础的工作可能是性能的实际瓶颈,所以也许您应该使用更好的块大小? What you can do is, for example, read 4096 bytes, find the last line ending \n , process on that part and prepend the part that is left to the next chunk.例如,您可以做的是读取 4096 个字节,找到以\n结尾的最后一行,对该部分进行处理并将剩下的部分添加到下一个块中。

You could always chunk the lines up?你总是可以把线路分块吗? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through.我的意思是当您可以打开同一个文件 6 次并迭代时,为什么要打开一个文件并一直迭代。 eg例如

a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes

Each file handle running in a separate process and we start to cook on gas.每个文件句柄在一个单独的进程中运行,我们开始用煤气做饭。 Just remember to deal with line endings properly.请记住正确处理行尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM