使用 Python 在大型文本文件中查找和替换特定行的最快方法

Question

I have a numbers.txt file that consists of several 100K lines, each one made up of two unique digits separated with : sign:我有一个由 100K 行组成的numbers.txt文件，每行由两个唯一的数字组成，用:符号分隔：

407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259

My goal is to locate a line with the number 407597693 on the left side and then proceed to change the number on the right side by adding 3600 to it.我的目标是在左侧找到一条编号为407597693的行，然后通过向其添加3600来更改右侧的编号。 After that, I have to rewrite the numbers.txt file with all the changes.之后，我必须用所有更改重写numbers.txt文件。 I must perform the same (just different number) operation on the same txt file as fast as possible.我必须尽快对同一个 txt 文件执行相同（只是数字不同）的操作。

I have managed to make it work via with open: file operations and for loop for each line, searching for the needed number, modifying the line, and then rewriting the whole file.我设法通过with open:文件操作和每行的for循环使其工作，搜索所需的数字，修改行，然后重写整个文件。 However, I've noticed that constantly performing such an operation does take some time for my program, about 0.2-0.5 sec, which adds up over time and slows everything down considerably.但是，我注意到不断执行这样的操作确实需要一些时间来运行我的程序，大约 0.2-0.5 秒，随着时间的推移它会加起来并大大减慢一切。

Here is the code I am using:这是我正在使用的代码：

number = 407597693

with open("numbers.txt", "r+") as library:
                file = library.read()
            if (str(number) + ":") in file:
                lines = file.splitlines()
                with open("numbers_temp.txt", "a+") as library_temp:
                    for line in lines:
                        if (str(number) + ":") in line:
                            library_temp.write(
                                "\n" + str(number) + ":" + str(time.time() + 3600)
                            )
                        else:
                            library_temp.write("\n" + line)

                    library_temp.seek(0)
                    new_file = library_temp.read()

                    with open("numbers.txt", "w+") as library_2:
                        library_2.write(new_file)

                os.remove("numbers_temp.txt")

I would really appreciate any input on how to speed up this process, many thanks in advance!我非常感谢有关如何加快此过程的任何意见，非常感谢！

Answer 1

You can open a memory mapped file, use a regular expression to find the line you want, and with any luck you'll only have to change one page in the file.你可以打开一个内存映射文件，使用正则表达式来找到你想要的行，幸运的话你只需要更改文件中的一页。 I'm using the decimal module so that you don't have decimal to binary float conversion problems.我正在使用十进制模块，因此您没有十进制到二进制浮点数转换问题。 Usually the new number and the old number will be the same width and file contents will not need to be moved.通常新号码和旧号码的宽度相同，不需要移动文件内容。 I'm showing a linux example.我正在展示一个 linux 示例。 Windows mmap.map is a bit different but should be easy to use. Windows mmap.map有点不同，但应该易于使用。

import mmap
import re
from decimal import Decimal

def increment_record(filename, findval, increment):
    with open(filename, "rb+") as fp:
        with mmap.mmap(fp.fileno(), 0) as fmap:
            search = re.search(rf"{findme}:([\d\.]+)".encode("ascii"), fmap, 
                    re.MULTILINE)
            if search:
                # found float to change. Use Decimal for base 10 precision
                newval = Decimal(search.group(1).decode("ascii")) + increment
                newval = f"{newval}".encode("ascii")
                delta = len(newval) - len(search.group(1))
                if delta:
                    # need to expand file and copy
                    fsize = fmap.size()
                    fmap.resize(fsize + delta)
                    fmap.move(search.end(1) + delta, search.end(1), 
                        fsize - search.end(1))
                # change just the number
                fmap[search.start(1):search.start(1) + len(newval)] = newval

# test parameters
filename = "test.txt"
findme = "76792361"
increment = 3600

testdata = """407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259"""

open(filename, "w").write(testdata)

increment_record(filename, findme, increment)

print("changes:")
for old,new in zip(testdata.split("\n"), open(filename)):
    new = new.strip()
    if old != new:
        print((old,new))
print("done")

Answer 2

I assume your memory can store the whole file.我假设您的内存可以存储整个文件。 This should be faster by using regex:使用正则表达式应该会更快：

import re
number = 407597693
with open("numbers.txt", "r") as f:
    data = f.read()
    # data = re.sub(f'({number}):(.*)', lambda x:f"{x.group(1)}:{float(x.group(2))+3600}", data)
    data = re.sub("^" + str(number) + ".*\n", str(number) + ":" + str(int(time.time()) + 3600) + "\n", data, flags=re.MULTILINE)
with open("numbers.txt", "w") as f:
    f.write(data)

Answer 3

Rather that having to run multiple loops, we can do this in a single loop as under:不必运行多个循环，我们可以在单个循环中执行此操作，如下所示：

number = 407597693
numbers = ''
with open('numbers.txt', "r+") as inputfile:
    file = inputfile.read()

    if(file.find(str(number))) != -1 :
        for line in file.splitlines():
            if (line.find(str(number))) == 0:
                numbers += line.split(':')[0] + ':' + str(float(line.split(':')[1]) + float(3600)) + '\n'
            else:
                numbers += line + '\n'

    with open('numbers.txt', 'w') as updatedFile:
    updatedFile.writelines(numbers)

Hopefully this shall help..希望这会有所帮助..

使用 Python 在大型文本文件中查找和替换特定行的最快方法

问题描述

3 个解决方案

解决方案1
2 2020-11-08 02:51:20

解决方案2
1 已采纳 2020-11-08 01:58:12

解决方案3
0 2020-11-08 02:05:10

使用 Python 在大型文本文件中查找和替换特定行的最快方法

问题描述

3 个解决方案

解决方案1 2 2020-11-08 02:51:20

解决方案2 1 已采纳 2020-11-08 01:58:12

解决方案3 0 2020-11-08 02:05:10

解决方案1
2 2020-11-08 02:51:20

解决方案2
1 已采纳 2020-11-08 01:58:12

解决方案3
0 2020-11-08 02:05:10