I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it's extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.
A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ...
instead of open()/close()
. The with
statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because |
is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy . PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run()
function if you're working in Python.
I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.