在 python 中递归搜索文件的最快方法是什么？

Question

我需要通过递归搜索生成包含特定字符串的路径的文件列表。 我现在这样做是这样的：

for i in iglob(starting_directory+'/**/*', recursive=True):
    if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask
        filelist.append(i)

这行得通，但是在爬取大型目录树时，速度非常慢（约 10 分钟）。 我们在 Windows 上，因此无法从外部调用 unix 查找命令。 我的理解是 glob 比 os.walk 快。

有更快的方法吗？

Answer 1

也许不是你希望的答案，但我认为这些时间是有用的。 在一个包含15,424个目录的目录上运行，共计102,799个文件（其中3059个是.py文件）。

Python 3.6：

import os
import glob

def walk():
    pys = []
    for p, d, f in os.walk('.'):
        for file in f:
            if file.endswith('.py'):
                pys.append(file)
    return pys

def iglob():
    pys = []
    for file in glob.iglob('**/*', recursive=True):
        if file.endswith('.py'):
            pys.append(file)
    return pys

def iglob2():
    pys = []
    for file in glob.iglob('**/*.py', recursive=True):
        pys.append(file)
    return pys

# I also tried pathlib.Path.glob but it was slow and error prone, sadly

%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

在cygwin上使用GNU find（4.6.0）（4.6.0-1）

$ time find . -name '*.py' > /dev/null

real    0m8.827s
user    0m1.482s
sys     0m7.284s

好像os.walk一样好，你可以得到。

Answer 2

os.walk()使用最快的 scandir，我们得到文件 object，它可以用于许多其他目的，如下所示，我得到修改后的时间。 下面的代码使用os.scandir()实现递归搜索

import os
import time
def scantree(path):
    """Recursively yield DirEntry objects for given directory."""
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            yield from scantree(entry.path) 
        else:
            yield entry
        
for entry in scantree('/home/'):
    if entry.is_file():
        print(entry.path,time.ctime(entry.stat().st_mtime))

在 python 中递归搜索文件的最快方法是什么？

问题描述

2 个解决方案

解决方案1
11 已采纳 2018-06-20 14:44:37

解决方案2
0 2022-12-04 14:49:46

在 python 中递归搜索文件的最快方法是什么？

问题描述

2 个解决方案

解决方案1 11 已采纳 2018-06-20 14:44:37

解决方案2 0 2022-12-04 14:49:46

解决方案1
11 已采纳 2018-06-20 14:44:37

解决方案2
0 2022-12-04 14:49:46