内存错误Python逐行处理大文件

Question

我正在尝试连接模型输出文件，由于软件输出到文件的方式从每个文件输出的0开始重新标记，因此模型运行被分成5个部分，每个输出对应于部分运行之一。 我写了一些代码到：

1）将所有输出文件连接在一起2）编辑合并的文件以重新标记所有时间步，从0开始，然后每增加一个。

这样做的目的是可以将一个文件加载到我的可视化软件中，而不是打开5个不同的窗口，而只需一小块。

到目前为止，由于要处理的文件很大，我的代码引发了内存错误。

我对如何尝试摆脱它有一些想法，但是我不确定什么会起作用或可能会使事情变慢。

到目前为止的代码：

import os
import time

start_time = time.time()

#create new txt file in smae folder as python script

open("domain.txt","w").close()


"""create concatenated document of all tecplot output files"""
#look into file number 1

for folder in range(1,6,1): 
    folder = str(folder)
    for name in os.listdir(folder):
        if "domain" in name:
            with open(folder+'/'+name) as file_content_list:
                start = ""
                for line in file_content_list:
                    start = start + line# + '\n' 
                with open('domain.txt','a') as f:
                    f.write(start)
              #  print start

#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist

""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alteration

with open("domain.txt", "r") as combined_output:
    start = ""
    start_timestep = 0
    time_increment = 3.154e10
    for line in combined_output:
        if "ZONE" in line:
            start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
            start_timestep = start_timestep + time_increment
        else:
            start = start + line

    with open('domain_final.txt','w') as f:
        f.write(start)

end_time = time.time()
print 'runtime : ', end_time-start_time

os.remove("domain.txt")

到目前为止，我在串联阶段遇到了内存错误。

要改善，我可以：

1）读取每个文件时，请尝试随时进行更正，但是由于它已经无法遍历整个文件，因此我认为除了计算时间之外，这没有什么大不同

2）将所有文件作为数组加载，并执行检查功能，然后在数组上运行该功能：

就像是：

def do_correction(line):
        if "ZONE" in line:
            return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
        else:
            return line

3）保持原样，并要求Python指出何时该内存将用完并在该阶段写入文件。 有人知道这是否可能吗？

谢谢您的帮助

Answer 1

在写入输出文件之前，不必将每个文件的全部内容读入内存。 大文件只会消耗（可能全部）可用内存。

只需一次读写一行。 还要仅打开输出文件一次...并选择一个不会被选择并用作输入文件本身的名称，否则您将冒着将输出文件连接到自身的风险（这不是问题，但如果您还可以处理当前目录中的文件）-如果加载该文件还不会消耗所有内存。

import os.path

with open('output.txt', 'w') as outfile:
    for folder in range(1,6,1): 
        for name in os.listdir(folder):
            if "domain" in name:
                with open(os.path.join(str(folder), name)) as file_content_list:
                    for line in file_content_list:
                        # perform corrections/modifications to line here
                        outfile.write(line)

现在，您可以以面向行的方式处理数据-只需在写入输出文件之前对其进行修改。

内存错误Python逐行处理大文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-24 10:27:11

内存错误Python逐行处理大文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-24 10:27:11

解决方案1
2 已采纳 2017-02-24 10:27:11