[英]Fastest way to find and replace specific line in a large text file with Python
I have a numbers.txt
file that consists of several 100K lines, each one made up of two unique digits separated with :
sign:我有一个由 100K 行组成的
numbers.txt
文件,每行由两个唯一的数字组成,用:
符号分隔:
407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259
My goal is to locate a line with the number 407597693
on the left side and then proceed to change the number on the right side by adding 3600
to it.我的目标是在左侧找到一条编号为
407597693
的行,然后通过向其添加3600
来更改右侧的编号。 After that, I have to rewrite the numbers.txt
file with all the changes.之后,我必须用所有更改重写
numbers.txt
文件。 I must perform the same (just different number) operation on the same txt file as fast as possible.我必须尽快对同一个 txt 文件执行相同(只是数字不同)的操作。
I have managed to make it work via with open:
file operations and for
loop for each line, searching for the needed number, modifying the line, and then rewriting the whole file.我设法通过
with open:
文件操作和每行的for
循环使其工作,搜索所需的数字,修改行,然后重写整个文件。 However, I've noticed that constantly performing such an operation does take some time for my program, about 0.2-0.5 sec, which adds up over time and slows everything down considerably.但是,我注意到不断执行这样的操作确实需要一些时间来运行我的程序,大约 0.2-0.5 秒,随着时间的推移它会加起来并大大减慢一切。
Here is the code I am using:这是我正在使用的代码:
number = 407597693
with open("numbers.txt", "r+") as library:
file = library.read()
if (str(number) + ":") in file:
lines = file.splitlines()
with open("numbers_temp.txt", "a+") as library_temp:
for line in lines:
if (str(number) + ":") in line:
library_temp.write(
"\n" + str(number) + ":" + str(time.time() + 3600)
)
else:
library_temp.write("\n" + line)
library_temp.seek(0)
new_file = library_temp.read()
with open("numbers.txt", "w+") as library_2:
library_2.write(new_file)
os.remove("numbers_temp.txt")
I would really appreciate any input on how to speed up this process, many thanks in advance!我非常感谢有关如何加快此过程的任何意见,非常感谢!
You can open a memory mapped file, use a regular expression to find the line you want, and with any luck you'll only have to change one page in the file.你可以打开一个内存映射文件,使用正则表达式来找到你想要的行,幸运的话你只需要更改文件中的一页。 I'm using the decimal module so that you don't have decimal to binary float conversion problems.
我正在使用十进制模块,因此您没有十进制到二进制浮点数转换问题。 Usually the new number and the old number will be the same width and file contents will not need to be moved.
通常新号码和旧号码的宽度相同,不需要移动文件内容。 I'm showing a linux example.
我正在展示一个 linux 示例。 Windows
mmap.map
is a bit different but should be easy to use. Windows
mmap.map
有点不同,但应该易于使用。
import mmap
import re
from decimal import Decimal
def increment_record(filename, findval, increment):
with open(filename, "rb+") as fp:
with mmap.mmap(fp.fileno(), 0) as fmap:
search = re.search(rf"{findme}:([\d\.]+)".encode("ascii"), fmap,
re.MULTILINE)
if search:
# found float to change. Use Decimal for base 10 precision
newval = Decimal(search.group(1).decode("ascii")) + increment
newval = f"{newval}".encode("ascii")
delta = len(newval) - len(search.group(1))
if delta:
# need to expand file and copy
fsize = fmap.size()
fmap.resize(fsize + delta)
fmap.move(search.end(1) + delta, search.end(1),
fsize - search.end(1))
# change just the number
fmap[search.start(1):search.start(1) + len(newval)] = newval
# test parameters
filename = "test.txt"
findme = "76792361"
increment = 3600
testdata = """407597693:1604722326.2426915
510905857:1604722326.2696202
76792361:1604722331.120079
112854912:1604722333.4496727
470822611:1604722335.283259"""
open(filename, "w").write(testdata)
increment_record(filename, findme, increment)
print("changes:")
for old,new in zip(testdata.split("\n"), open(filename)):
new = new.strip()
if old != new:
print((old,new))
print("done")
I assume your memory can store the whole file.我假设您的内存可以存储整个文件。 This should be faster by using regex:
使用正则表达式应该会更快:
import re
number = 407597693
with open("numbers.txt", "r") as f:
data = f.read()
# data = re.sub(f'({number}):(.*)', lambda x:f"{x.group(1)}:{float(x.group(2))+3600}", data)
data = re.sub("^" + str(number) + ".*\n", str(number) + ":" + str(int(time.time()) + 3600) + "\n", data, flags=re.MULTILINE)
with open("numbers.txt", "w") as f:
f.write(data)
Rather that having to run multiple loops, we can do this in a single loop as under:不必运行多个循环,我们可以在单个循环中执行此操作,如下所示:
number = 407597693
numbers = ''
with open('numbers.txt', "r+") as inputfile:
file = inputfile.read()
if(file.find(str(number))) != -1 :
for line in file.splitlines():
if (line.find(str(number))) == 0:
numbers += line.split(':')[0] + ':' + str(float(line.split(':')[1]) + float(3600)) + '\n'
else:
numbers += line + '\n'
with open('numbers.txt', 'w') as updatedFile:
updatedFile.writelines(numbers)
Hopefully this shall help..希望这会有所帮助..
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.