简体   繁体   中英

How to bypass memory error when replacing a string in a large txt file?

I have several files to iterate through, some of them several million lines long. One file can have more than 500 MB. I need to prep them by searching and replacing '| |' '| |' string with '|'string.

However, the following code runs into a "Memory error". How to rework the code to search and replace the files by line to save RAM? Any ideas? This is not about reading the large file line by line as rather replacing string line by line and avoiding issue with transforming list into string and vice versa.

import os
didi = self.lineEdit.text()
for filename in os.listdir(didi):            
    if filename.endswith(".txt"):
        filepath = os.path.join(didi, filename)
        with open(filepath, errors='ignore') as file:
            s = file.read()
            s = s.replace('| |', '|')
        with open(filepath, "w") as file:
               file.write(s)

Try the following code:

chunk_size = 5000
buffer = ""
i = 0

with open(fileoutpath, 'a') as fout:
    with open(fileinpath, 'r') as fin:
        for line in fin:
            buffer += line.replace('| |', '|')
            i+=1
            if i == chunk_size:
                    fout.write(buffer)
                    i=0
                    buffer = ""
    if buffer:
        fout.write(buffer)
        i=0
        buffer = ""

This code reads one line at a time in memory.

It stores the results in a buffer , which at most will contain chunk_size lines at a time, after which it saves the result to file and cleans the buffer . And so it goes on until the end of the file. At the end of the reading loop, if the buffer contains lines, it is written to disk.

In this way, in addition to checking the number of lines in memory, you also check the number of disk writes. Writing to files every time you read a line may not be a good idea, as well as having a chunk_size too large. It's up to you to find a chunk_size value that fits your problem.

Note : You can use the open() buffering parameter, to get the same result. Find everything in documentation . But the logic is very similar.

Try reading the file in line-by-line, instead of one giant chunk. Ie

with open(writefilepath, "w", errors='ignore') as filew:
    with open(readfilepath, "r", errors='ignore') as filer:
       for line in filer:
           print("Line {}: {}".format(cnt, line.strip()))
           line = line.replace('| |', '|')
           filew.write(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM