I have a text file of 100GB containing 100 billion digits of pi, and I need a fast way to search if a 21 digit number is anywhere in this file. Note that the whole file is a single line so no linebreak. I have this function that uses a large buffer (500mb) to load parts of the file and check if the number is there:
def fnd(s):
start = 2
with open("pi_dec_1t_01.txt", 'r') as f:
fsize = os.path.getsize("pi_dec_1t_01.txt")
bsize = 536870912
buffer = None
if start > 0:
f.seek(start)
overlap = len(s) - 1
while True:
if (f.tell() >= overlap and f.tell() < fsize):
f.seek(f.tell() - overlap)
buffer = f.read(bsize)
if buffer:
pos = buffer.find(s)
if pos >= 0:
return f.tell() - (len(buffer) - pos)
else:
return -1
It is fast if I wanted to search only one of these numbers, but I need to search up to 2 billion (until I find one), which would literally take centuries. Any time efficient way to do this? Even if I needed to use some other language or platform
You could examine this package and maybe find more info on the algorithm it implements: https://pyahocorasick.readthedocs.io/en/latest/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.