简体   繁体   中英

Best way to search multiple files for keywords efficiently in python 3.x?

Sorry if this has been asked before, but i didn't seem to find a solution to my problem.

I have around 500 text files, each around 5-6 kB in size. I need to search every file and check if a particular keyword is present in it, and print the details of every file in which the keyword is present.

I can do this using

for files in glob.glob("*"):
      and then search for the keyword inside the file

I'm sure this isn't the most efficient way to do this. What better way is there?

If you want all *.c files in your directory which include the stdio.h file, you could do

grep "stdio\.h" *.c

(note - edited to respond to @Wooble's comment.)

The result might look like this

myfile.c: #include <stdio.h>
thatFile.c: #include <stdio.h>

etc.

If you want to see "context" (eg the line before and after), use the C flag:

grep -C1 "(void)" *.c

result:

scanline.c-
scanline.c:int main(void){
scanline.c-  double sum=0;
--
tour.c-
tour.c:int main(void) {
tour.c-int *bitMap;

etc.

I think this should work well for you.

Again, addressing @Wooble's other point: if you really want to do this with Python, you could use

import subprocess

p = subprocess.Popen('grep stdio *.c', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in p.stdout.readlines():
    print line,
retval = p.wait()

Now you have access to the output "in Python" and can do clever things with the lines as you see fit.

grep isn't always an option. If you're writing a python script to be used in a work environment, and that environment happens to be primarily Windows, then you're biting off dependency management for your team when you tell them they need to have grep installed. That's no good.

I haven't found anything faster than glob for searching the filesystem, but there are ways to speed up searching through your files. For example, if you know your files are going to have a lot of short lines (like json or xml files for example) you could skip looking at any lines that are shorter than your smallest keyword.

the regex library in python is pretty slow, as well. It is much faster, by an order of magnitude or more, to search each line one character at a time to see if line[ len(str_to_search_for) : ] == str_to_search_for than to run a regex on each line.

I've been doing quite a bit of searching on the filesystem lately, and for a data set of 500GB my searches started at about 8 hours and I managed to get them down to 3 using simple techniques like these. It takes some time because you are tailoring your strategy to your use case, but you can squeeze a lot of speed out of python if you do so.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM