简体   繁体   English

如何使用python有效跳过文件中的前n行?

[英]How to effectively skip the first n lines in a file with python?

I am currently using a C++ script with a Python wrapper for manipulating a larger (15 GB) text file line-by-line. 我目前正在使用带有Python包装程序的C ++脚本逐行处理较大(15 GB)的文本文件。 Effectively what it does is it reads a line from input.txt, processes it, the outputs the result to output.txt. 实际上,它所做的是从input.txt中读取一行,对其进行处理,然后将结果输出到output.txt。 I am using the straigtforward loop here (inp being opened as input.txt, out being opened as output.txt): 我在这里使用straigtforward循环(inp作为input.txt打开,out作为output.txt打开):

for line in inp:
    result = operate(line)
    out.write(result)

However, because of the C++ script's issues, it has some failure rate, which causes the loop to shut after about ten million iterations. 但是,由于C ++脚本的问题,它具有一定的故障率,这导致循环在大约一千万次迭代后关闭。 This leaves me with an output file made using only like 10% of the input. 这给我留下了仅使用输入的10%制作的输出文件。

Since I have no means of fixing the original script, I thought about just restarting it where it stopped. 由于我无法修复原始脚本,因此我考虑过在停止的地方重新启动它。 I counted the lines of output.txt, made another called output2.txt, and started the following code: 我计算了output.txt的行数,制作了另一个名为output2.txt的行,并启动了以下代码:

k = 0
for line in inp:
    if k < 12123253:
        k + = 1
    else:
        result = operate(line)
        out2.write(result)
        k + = 1

However, compared to when I was counting the lines, which ended under a minute, this method takes long hours to get to the designated line. 但是,与计数一分钟以下的行相比,此方法要花很长时间才能到达指定的行。

Why is this method inefficient? 为什么这种方法效率低下? Is there a faster one? 有更快的吗? I am on a Windows pc with a strong calculating capability (72GB RAM, good processors), and using python 2.7. 我在Windows PC上具有强大的计算能力(72GB RAM,良好的处理器),并且使用python 2.7。

I suggest you to use itertools 我建议您使用itertools

with open(inp) as f:
    result = itertools.islice(f, start_line, None)
    for i in result:
        #do something with this line

you may use file.seek and file.tell . 您可以使用file.seekfile.tell Below is the sample (pseudo) code: 下面是示例(伪)代码:

def seralizebreakpoint(pos):
    pass

def desearializebreakpoint():
    '''return -1 if there is actually no break point'''
    pass

def process(inp):

    pos = inp.tell()
    for line in inp:
        try:
            result = operate(line)
            pos = inp.tell()            
        except:
            seralizebreakpoint(pos)
            raise

def processEntry(pathtoinput):

    bp = desearializebreakpoint() 
    with open(pathtoinput, 'r') as inp:
        if bp > -1:
            inp.seek(bp)
        process(inp)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM