简体   繁体   English

从文件中读取特定数量的行而不存储在 memory 中?

[英]Reading a specific number of lines from a file without storing in memory?

I have data that I need to read and extract specific blocks from using a python code but the files are potentially tens of millions of lines long and too large to store in memory so I only want to pull the data that I actually need to analyse.我有数据需要使用 python 代码读取和提取特定块,但文件可能有数千万行长,而且太大而无法存储在 memory 中,所以我只想提取我实际需要分析的数据。

The files are formatted as follows:文件格式如下:

4 # Number of lines per block
0 # Start of block 0
A line of data
A line of data
A line of data
A line of data
1 # Start of block 1
A line of data
A line of data
...

The issue I'm having is that once I find and read the specific block I need into a list, my code continues reading and adding data until the end of the file instead of the end of that specific block.我遇到的问题是,一旦找到并将所需的特定块读入列表中,我的代码就会继续读取和添加数据,直到文件末尾而不是该特定块的末尾。

Here's what I have so far:这是我到目前为止所拥有的:

required_block = 5
ilepath = file.txt
data = []

with open(filepath, 'r') as f:
    block_length = int(f.readline())
    for line in f:
        block = int(line)
        if block != required_block:
            for _ in range(block_length)
                next(f)
        else:
            break
    for line in f:
        data.append(line)

If I try to add a range to the last 'for' loop it will just read the current line over and over.如果我尝试在最后一个“for”循环中添加一个范围,它将一遍又一遍地读取当前行。

Where am I going wrong?我哪里错了?

EDIT: To clarify, I only want the last 'for' loop to run < block_length > number of times.编辑:为了澄清,我只希望最后一个“for”循环运行 <block_length> 次数。

If you look at your code, your last for loop is the culprit.如果您查看您的代码,您的最后一个 for 循环是罪魁祸首。 You're telling it to append everything no matter what.无论如何,您都在告诉 append 一切。 In your first for loop, you're not actually having it append anything at all.在您的第一个 for 循环中,您实际上根本没有 append 任何东西。 So essentially in the first loop it just runs through the data, then in the second one it appends everything because the append is outside of the logic.所以基本上在第一个循环中它只是遍历数据,然后在第二个循环中它附加所有内容,因为 append 不在逻辑范围内。

I think what you want is something like this:认为你想要的是这样的:

for line in f:
        block = int(line)
        if block != required_block:
            next(f)
        else:
            for _ in range(block_length):
                data.append(line)

Try changing your last loop to this:尝试将最后一个循环更改为:

for _ in range(block_length):
    data.append(f.readLine())

Reading file line by line:逐行读取文件:

filepath = 'Iliad.txt'
    with open(filepath) as fp:
       line = fp.readline()
       cnt = 1
       while line:
           print("Line {}: {}".format(cnt, line.strip()))
           line = fp.readline()
           cnt += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM