简体   繁体   中英

Efficiently search for many different strings in large file

I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.

  1. Size of the file to be searched is 150mb – it's text file.
  2. I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
  3. It's not simple search, I have to use regex to search the string.
  4. In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
  5. In few cases I also have to replace the searched string with new string.

First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:

#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
    for i in range(len(searchstrings)):
        if searchstrings[i] in line:
            print(line)
file.close()

This code does the job but it's extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.

Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.

file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
    src_str = re.compile(searchstrings[i], re.IGNORECASE)
    file_data = src_str.sub(replacestrings[i], file_data)
file.close()

I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.

A few observations.

For idiomatic Python, you usually want

for string in searchstrings:
    ...

instead of

for i in range(len(searchstrings)):
    searchstrings[i]

and with open(filename) as f: ... instead of open()/close() . The with statement will close the file automatically.

When you want to replace any of several strings with a regex, you can do

re.sub('|'.join(YOUR_STRINGS), replacement, text)

because | is the regex symbol for "or", instead of looping over them all individually.

For performance, I might try switching from CPython to PyPy . PyPy is another implementation of the same language but often much faster.

On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.

I like Unix commands, they are fun, fast and efficient.

import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM