简体   繁体   中英

Fastest way to find and replace specific line in a large text file with Python

I have a numbers.txt file that consists of several 100K lines, each one made up of two unique digits separated with : sign:

407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259

My goal is to locate a line with the number 407597693 on the left side and then proceed to change the number on the right side by adding 3600 to it. After that, I have to rewrite the numbers.txt file with all the changes. I must perform the same (just different number) operation on the same txt file as fast as possible.

I have managed to make it work via with open: file operations and for loop for each line, searching for the needed number, modifying the line, and then rewriting the whole file. However, I've noticed that constantly performing such an operation does take some time for my program, about 0.2-0.5 sec, which adds up over time and slows everything down considerably.

Here is the code I am using:

number = 407597693

with open("numbers.txt", "r+") as library:
                file = library.read()
            if (str(number) + ":") in file:
                lines = file.splitlines()
                with open("numbers_temp.txt", "a+") as library_temp:
                    for line in lines:
                        if (str(number) + ":") in line:
                            library_temp.write(
                                "\n" + str(number) + ":" + str(time.time() + 3600)
                            )
                        else:
                            library_temp.write("\n" + line)

                    library_temp.seek(0)
                    new_file = library_temp.read()

                    with open("numbers.txt", "w+") as library_2:
                        library_2.write(new_file)

                os.remove("numbers_temp.txt")

I would really appreciate any input on how to speed up this process, many thanks in advance!

You can open a memory mapped file, use a regular expression to find the line you want, and with any luck you'll only have to change one page in the file. I'm using the decimal module so that you don't have decimal to binary float conversion problems. Usually the new number and the old number will be the same width and file contents will not need to be moved. I'm showing a linux example. Windows mmap.map is a bit different but should be easy to use.

import mmap
import re
from decimal import Decimal

def increment_record(filename, findval, increment):
    with open(filename, "rb+") as fp:
        with mmap.mmap(fp.fileno(), 0) as fmap:
            search = re.search(rf"{findme}:([\d\.]+)".encode("ascii"), fmap, 
                    re.MULTILINE)
            if search:
                # found float to change. Use Decimal for base 10 precision
                newval = Decimal(search.group(1).decode("ascii")) + increment
                newval = f"{newval}".encode("ascii")
                delta = len(newval) - len(search.group(1))
                if delta:
                    # need to expand file and copy
                    fsize = fmap.size()
                    fmap.resize(fsize + delta)
                    fmap.move(search.end(1) + delta, search.end(1), 
                        fsize - search.end(1))
                # change just the number
                fmap[search.start(1):search.start(1) + len(newval)] = newval

# test parameters
filename = "test.txt"
findme = "76792361"
increment = 3600

testdata = """407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259"""

open(filename, "w").write(testdata)

increment_record(filename, findme, increment)

print("changes:")
for old,new in zip(testdata.split("\n"), open(filename)):
    new = new.strip()
    if old != new:
        print((old,new))
print("done")

I assume your memory can store the whole file. This should be faster by using regex:

import re
number = 407597693
with open("numbers.txt", "r") as f:
    data = f.read()
    # data = re.sub(f'({number}):(.*)', lambda x:f"{x.group(1)}:{float(x.group(2))+3600}", data)
    data = re.sub("^" + str(number) + ".*\n", str(number) + ":" + str(int(time.time()) + 3600) + "\n", data, flags=re.MULTILINE)
with open("numbers.txt", "w") as f:
    f.write(data)

Rather that having to run multiple loops, we can do this in a single loop as under:

number = 407597693
numbers = ''
with open('numbers.txt', "r+") as inputfile:
    file = inputfile.read()

    if(file.find(str(number))) != -1 :
        for line in file.splitlines():
            if (line.find(str(number))) == 0:
                numbers += line.split(':')[0] + ':' + str(float(line.split(':')[1]) + float(3600)) + '\n'
            else:
                numbers += line + '\n'

    with open('numbers.txt', 'w') as updatedFile:
    updatedFile.writelines(numbers)

Hopefully this shall help..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM