简体   繁体   中英

What's the fastest way to recursively search for files in python?

I need to generate a list of files with paths that contain a certain string by recursively searching. I'm doing this currently like this:

for i in iglob(starting_directory+'/**/*', recursive=True):
    if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask
        filelist.append(i) 

This works, but when crawling a large directory tree, it's woefully slow (~10 minutes). We're on Windows, so doing an external call to the unix find command isn't an option. My understanding is that glob is faster than os.walk.

Is there a faster way of doing this?

Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).

Python 3.6:

import os
import glob

def walk():
    pys = []
    for p, d, f in os.walk('.'):
        for file in f:
            if file.endswith('.py'):
                pys.append(file)
    return pys

def iglob():
    pys = []
    for file in glob.iglob('**/*', recursive=True):
        if file.endswith('.py'):
            pys.append(file)
    return pys

def iglob2():
    pys = []
    for file in glob.iglob('**/*.py', recursive=True):
        pys.append(file)
    return pys

# I also tried pathlib.Path.glob but it was slow and error prone, sadly

%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using GNU find (4.6.0) on cygwin (4.6.0-1)

$ time find . -name '*.py' > /dev/null

real    0m8.827s
user    0m1.482s
sys     0m7.284s

Seems like os.walk is as good as you can get.

os.walk() uses scandir which is the fastest and we get the file object that can be used for many other purposes as well like, below I am getting the modified time. Below code implement recursive serach using os.scandir()

import os
import time
def scantree(path):
    """Recursively yield DirEntry objects for given directory."""
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            yield from scantree(entry.path) 
        else:
            yield entry
        
for entry in scantree('/home/'):
    if entry.is_file():
        print(entry.path,time.ctime(entry.stat().st_mtime))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM