简体   繁体   English

替换大txt文件中的字符串时如何绕过memory错误?

[英]How to bypass memory error when replacing a string in a large txt file?

I have several files to iterate through, some of them several million lines long.我有几个文件要遍历,其中一些有几百万行长。 One file can have more than 500 MB.一个文件可以超过 500 MB。 I need to prep them by searching and replacing '| |'我需要通过搜索和替换来准备它们'| |' '| |' string with '|''|'字符串string.细绳。

However, the following code runs into a "Memory error".但是,以下代码会遇到“内存错误”。 How to rework the code to search and replace the files by line to save RAM?如何重新编写代码以逐行搜索和替换文件以节省 RAM? Any ideas?有任何想法吗? This is not about reading the large file line by line as rather replacing string line by line and avoiding issue with transforming list into string and vice versa.这不是关于逐行读取大文件,而是逐行替换字符串并避免将列表转换为字符串的问题,反之亦然。

import os
didi = self.lineEdit.text()
for filename in os.listdir(didi):            
    if filename.endswith(".txt"):
        filepath = os.path.join(didi, filename)
        with open(filepath, errors='ignore') as file:
            s = file.read()
            s = s.replace('| |', '|')
        with open(filepath, "w") as file:
               file.write(s)

Try the following code:试试下面的代码:

chunk_size = 5000
buffer = ""
i = 0

with open(fileoutpath, 'a') as fout:
    with open(fileinpath, 'r') as fin:
        for line in fin:
            buffer += line.replace('| |', '|')
            i+=1
            if i == chunk_size:
                    fout.write(buffer)
                    i=0
                    buffer = ""
    if buffer:
        fout.write(buffer)
        i=0
        buffer = ""

This code reads one line at a time in memory.此代码在 memory 中一次读取一行。

It stores the results in a buffer , which at most will contain chunk_size lines at a time, after which it saves the result to file and cleans the buffer .它将结果存储在buffer中,一次最多包含chunk_size行,之后将结果保存到文件并清理buffer And so it goes on until the end of the file.就这样一直持续到文件结束。 At the end of the reading loop, if the buffer contains lines, it is written to disk.在读取循环结束时,如果缓冲区包含行,则将其写入磁盘。

In this way, in addition to checking the number of lines in memory, you also check the number of disk writes.这样,除了查看memory中的行数外,还可以查看磁盘写入次数。 Writing to files every time you read a line may not be a good idea, as well as having a chunk_size too large.每次读取一行时写入文件可能不是一个好主意,而且chunk_size太大。 It's up to you to find a chunk_size value that fits your problem.找到适合您问题的chunk_size值由您决定。

Note : You can use the open() buffering parameter, to get the same result.注意:您可以使用open()缓冲参数来获得相同的结果。 Find everything in documentation .文档中查找所有内容。 But the logic is very similar.但逻辑非常相似。

Try reading the file in line-by-line, instead of one giant chunk.尝试逐行读取文件,而不是一大块。 Ie IE

with open(writefilepath, "w", errors='ignore') as filew:
    with open(readfilepath, "r", errors='ignore') as filer:
       for line in filer:
           print("Line {}: {}".format(cnt, line.strip()))
           line = line.replace('| |', '|')
           filew.write(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM