[英]How to effectively skip the first n lines in a file with python?
I am currently using a C++ script with a Python wrapper for manipulating a larger (15 GB) text file line-by-line. 我目前正在使用带有Python包装程序的C ++脚本逐行处理较大(15 GB)的文本文件。 Effectively what it does is it reads a line from input.txt, processes it, the outputs the result to output.txt.
实际上,它所做的是从input.txt中读取一行,对其进行处理,然后将结果输出到output.txt。 I am using the straigtforward loop here (inp being opened as input.txt, out being opened as output.txt):
我在这里使用straigtforward循环(inp作为input.txt打开,out作为output.txt打开):
for line in inp:
result = operate(line)
out.write(result)
However, because of the C++ script's issues, it has some failure rate, which causes the loop to shut after about ten million iterations. 但是,由于C ++脚本的问题,它具有一定的故障率,这导致循环在大约一千万次迭代后关闭。 This leaves me with an output file made using only like 10% of the input.
这给我留下了仅使用输入的10%制作的输出文件。
Since I have no means of fixing the original script, I thought about just restarting it where it stopped. 由于我无法修复原始脚本,因此我考虑过在停止的地方重新启动它。 I counted the lines of output.txt, made another called output2.txt, and started the following code:
我计算了output.txt的行数,制作了另一个名为output2.txt的行,并启动了以下代码:
k = 0
for line in inp:
if k < 12123253:
k + = 1
else:
result = operate(line)
out2.write(result)
k + = 1
However, compared to when I was counting the lines, which ended under a minute, this method takes long hours to get to the designated line. 但是,与计数一分钟以下的行相比,此方法要花很长时间才能到达指定的行。
Why is this method inefficient? 为什么这种方法效率低下? Is there a faster one?
有更快的吗? I am on a Windows pc with a strong calculating capability (72GB RAM, good processors), and using python 2.7.
我在Windows PC上具有强大的计算能力(72GB RAM,良好的处理器),并且使用python 2.7。
you may use file.seek and file.tell . 您可以使用file.seek和file.tell 。 Below is the sample (pseudo) code:
下面是示例(伪)代码:
def seralizebreakpoint(pos):
pass
def desearializebreakpoint():
'''return -1 if there is actually no break point'''
pass
def process(inp):
pos = inp.tell()
for line in inp:
try:
result = operate(line)
pos = inp.tell()
except:
seralizebreakpoint(pos)
raise
def processEntry(pathtoinput):
bp = desearializebreakpoint()
with open(pathtoinput, 'r') as inp:
if bp > -1:
inp.seek(bp)
process(inp)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.