简体   繁体   中英

How to search if a string is in a very large file in Python

I have a text file of 100GB containing 100 billion digits of pi, and I need a fast way to search if a 21 digit number is anywhere in this file. Note that the whole file is a single line so no linebreak. I have this function that uses a large buffer (500mb) to load parts of the file and check if the number is there:

def fnd(s):
    start = 2
    with open("pi_dec_1t_01.txt", 'r') as f:
        fsize = os.path.getsize("pi_dec_1t_01.txt")
        bsize = 536870912
        buffer = None
        if start > 0:
            f.seek(start)
        overlap = len(s) - 1
        while True:
            if (f.tell() >= overlap and f.tell() < fsize):
                f.seek(f.tell() - overlap)
            buffer = f.read(bsize)
            if buffer:
                pos = buffer.find(s)
                if pos >= 0:
                    return f.tell() - (len(buffer) - pos)
            else:
                return -1

It is fast if I wanted to search only one of these numbers, but I need to search up to 2 billion (until I find one), which would literally take centuries. Any time efficient way to do this? Even if I needed to use some other language or platform

You could examine this package and maybe find more info on the algorithm it implements: https://pyahocorasick.readthedocs.io/en/latest/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM