简体   繁体   English

如何使用 glob() 递归查找文件?

[英]How to use glob() to find files recursively?

This is what I have:这就是我所拥有的:

glob(os.path.join('src','*.c'))

but I want to search the subfolders of src.但我想搜索 src 的子文件夹。 Something like this would work:像这样的事情会起作用:

glob(os.path.join('src','*.c'))
glob(os.path.join('src','*','*.c'))
glob(os.path.join('src','*','*','*.c'))
glob(os.path.join('src','*','*','*','*.c'))

But this is obviously limited and clunky.但这显然是有限的和笨重的。

pathlib.Path.rglob路径库.Path.rglob

Use pathlib.Path.rglob from the the pathlib module, which was introduced in Python 3.5.使用pathlib.Path.rglob从该pathlib模块,这是在Python 3.5引入的。

from pathlib import Path

for path in Path('src').rglob('*.c'):
    print(path.name)

If you don't want to use pathlib, use can use glob.glob('**/*.c') , but don't forget to pass in the recursive keyword parameter and it will use inordinate amount of time on large directories.如果你不想使用 pathlib,可以使用glob.glob('**/*.c') ,但不要忘记传入recursive关键字参数,它会在大目录上使用过多的时间.

For cases where matching files beginning with a dot ( . );对于匹配文件以点 ( . ) 开头的情况; like files in the current directory or hidden files on Unix based system, use the os.walk solution below.像当前目录中的文件或基于 Unix 的系统上的隐藏文件,使用下面的os.walk解决方案。

os.walk os.walk

For older Python versions, use os.walk to recursively walk a directory and fnmatch.filter to match against a simple expression:对于较旧的 Python 版本,使用os.walk递归遍历目录,使用fnmatch.filter匹配一个简单的表达式:

import fnmatch
import os

matches = []
for root, dirnames, filenames in os.walk('src'):
    for filename in fnmatch.filter(filenames, '*.c'):
        matches.append(os.path.join(root, filename))

Similar to other solutions, but using fnmatch.fnmatch instead of glob, since os.walk already listed the filenames:与其他解决方案类似,但使用 fnmatch.fnmatch 而不是 glob,因为 os.walk 已经列出了文件名:

import os, fnmatch


def find_files(directory, pattern):
    for root, dirs, files in os.walk(directory):
        for basename in files:
            if fnmatch.fnmatch(basename, pattern):
                filename = os.path.join(root, basename)
                yield filename


for filename in find_files('src', '*.c'):
    print 'Found C source:', filename

Also, using a generator alows you to process each file as it is found, instead of finding all the files and then processing them.此外,使用生成器可以在找到每个文件时对其进行处理,而不是查找所有文件然后对其进行处理。

I've modified the glob module to support ** for recursive globbing, eg:我修改了 glob 模块以支持 ** 进行递归全局匹配,例如:

>>> import glob2
>>> all_header_files = glob2.glob('src/**/*.c')

https://github.com/miracle2k/python-glob2/ https://github.com/miracle2k/python-glob2/

Useful when you want to provide your users with the ability to use the ** syntax, and thus os.walk() alone is not good enough.当您希望为用户提供使用 ** 语法的能力时很有用,因此单独的 os.walk() 还不够好。

For python >= 3.5 you can use ** , recursive=True :对于 python >= 3.5,您可以使用**recursive=True

import glob
for f in glob.glob('/path/**/*.c', recursive=True):
    print(f)

Demo演示


If recursive is True , the pattern ** will match any files and zero or more directories and subdirectories .如果 recursive 为True ,模式**将匹配任何文件和零个或多个directoriessubdirectories If the pattern is followed by an os.sep , only directories and subdirectories match.如果模式后跟os.sep ,则只有目录和subdirectories匹配。


Note:笔记:

Python3.6 seems to default to recursive=True when using ** so it can be omitted. Python3.6在使用**时似乎默认为recursive=True ,因此可以省略。

Demo Python 3.6演示 Python 3.6

Starting with Python 3.4, one can use the glob() method of one of the Path classes in the new pathlib module, which supports ** wildcards.从 Python 3.4 开始,可以使用支持**通配符的新pathlib模块中Path类之一的glob()方法。 For example:例如:

from pathlib import Path

for file_path in Path('src').glob('**/*.c'):
    print(file_path) # do whatever you need with these files

Update: Starting with Python 3.5, the same syntax is also supported by glob.glob() .更新:从 Python 3.5 开始, glob.glob()也支持相同的语法。

import os
import fnmatch


def recursive_glob(treeroot, pattern):
    results = []
    for base, dirs, files in os.walk(treeroot):
        goodfiles = fnmatch.filter(files, pattern)
        results.extend(os.path.join(base, f) for f in goodfiles)
    return results

fnmatch gives you exactly the same patterns as glob , so this is really an excellent replacement for glob.glob with very close semantics. fnmatch为您提供与glob完全相同的模式,因此这确实是具有非常接近语义的glob.glob的绝佳替代品。 An iterative version (eg a generator), IOW a replacement for glob.iglob , is a trivial adaptation (just yield the intermediate results as you go, instead of extend ing a single results list to return at the end).迭代版本(例如生成器),IOW 替代glob.iglob ,是一种微不足道的适应(只yield中间结果,而不是extend单个结果列表以在最后返回)。

You'll want to use os.walk to collect filenames that match your criteria.您需要使用os.walk来收集符合您条件的文件名。 For example:例如:

import os
cfiles = []
for root, dirs, files in os.walk('src'):
  for file in files:
    if file.endswith('.c'):
      cfiles.append(os.path.join(root, file))

Here's a solution with nested list comprehensions, os.walk and simple suffix matching instead of glob :这是一个带有嵌套列表os.walkos.walk和简单后缀匹配而不是glob的解决方案:

import os
cfiles = [os.path.join(root, filename)
          for root, dirnames, filenames in os.walk('src')
          for filename in filenames if filename.endswith('.c')]

It can be compressed to a one-liner:它可以被压缩成一个单行:

import os;cfiles=[os.path.join(r,f) for r,d,fs in os.walk('src') for f in fs if f.endswith('.c')]

or generalized as a function:或概括为一个函数:

import os

def recursive_glob(rootdir='.', suffix=''):
    return [os.path.join(looproot, filename)
            for looproot, _, filenames in os.walk(rootdir)
            for filename in filenames if filename.endswith(suffix)]

cfiles = recursive_glob('src', '.c')

If you do need full glob style patterns, you can follow Alex's and Bruno's example and use fnmatch :如果您确实需要完整的glob样式模式,您可以按照 Alex 和 Bruno 的示例并使用fnmatch

import fnmatch
import os

def recursive_glob(rootdir='.', pattern='*'):
    return [os.path.join(looproot, filename)
            for looproot, _, filenames in os.walk(rootdir)
            for filename in filenames
            if fnmatch.fnmatch(filename, pattern)]

cfiles = recursive_glob('src', '*.c')

Consider pathlib.rglob() .考虑pathlib.rglob()

This is like calling Path.glob() with "**/" added in front of the given relative pattern:这就像在给定的相对模式前面添加"**/"调用Path.glob()

import pathlib


for p in pathlib.Path("src").rglob("*.c"):
    print(p)

See also @taleinat's related post here and a similar post elsewhere.另请参阅@taleinat 的相关帖子和其他地方的类似帖子

import os, glob

for each in glob.glob('path/**/*.c', recursive=True):
    print(f'Name with path: {each} \nName without path: {os.path.basename(each)}')
  • glob.glob('*.c') :matches all files ending in .c in current directory glob.glob('*.c') : 匹配当前目录中所有以.c结尾的文件
  • glob.glob('*/*.c') :same as 1 glob.glob('*/*.c') : 同 1
  • glob.glob('**/*.c') :matches all files ending in .c in the immediate subdirectories only, but not in the current directory glob.glob('**/*.c') :仅匹配直接子目录中以.c结尾的所有文件,而不匹配当前目录中的所有文件
  • glob.glob('*.c',recursive=True) :same as 1 glob.glob('*.c',recursive=True) : 同 1
  • glob.glob('*/*.c',recursive=True) :same as 3 glob.glob('*/*.c',recursive=True) : 同 3
  • glob.glob('**/*.c',recursive=True) :matches all files ending in .c in the current directory and in all subdirectories glob.glob('**/*.c',recursive=True) : 匹配当前目录和所有子目录中以.c结尾的所有文件

Recently I had to recover my pictures with the extension .jpg.最近我不得不使用扩展名 .jpg 恢复我的照片。 I ran photorec and recovered 4579 directories 2.2 million files within, having tremendous variety of extensions.With the script below I was able to select 50133 files havin .jpg extension within minutes:我运行 photorec 并恢复了 4579 个目录中的 220 万个文件,其中有各种各样的扩展名。使用下面的脚本,我能够在几分钟内选择 50133 个具有 .jpg 扩展名的文件:

#!/usr/binenv python2.7

import glob
import shutil
import os

src_dir = "/home/mustafa/Masaüstü/yedek"
dst_dir = "/home/mustafa/Genel/media"
for mediafile in glob.iglob(os.path.join(src_dir, "*", "*.jpg")): #"*" is for subdirectory
    shutil.copy(mediafile, dst_dir)

based on other answers this is my current working implementation, which retrieves nested xml files in a root directory:根据其他答案,这是我当前的工作实现,它检索根目录中的嵌套 xml 文件:

files = []
for root, dirnames, filenames in os.walk(myDir):
    files.extend(glob.glob(root + "/*.xml"))

I'm really having fun with python :)我真的很喜欢 python :)

Johan and Bruno provide excellent solutions on the minimal requirement as stated. Johan 和 Bruno 为上述最低要求提供了出色的解决方案。 I have just released Formic which implements Ant FileSet and Globs which can handle this and more complicated scenarios.我刚刚发布了实现 Ant FileSet 和 Glob 的Formic ,它们可以处理这个和更复杂的场景。 An implementation of your requirement is:您的要求的实现是:

import formic
fileset = formic.FileSet(include="/src/**/*.c")
for file_name in fileset.qualified_files():
    print file_name

For python 3.5 and later对于 python 3.5 及更高版本

import glob

#file_names_array = glob.glob('path/*.c', recursive=True)
#above works for files directly at path/ as guided by NeStack

#updated version
file_names_array = glob.glob('path/**/*.c', recursive=True)

further you might need进一步你可能需要

for full_path_in_src in  file_names_array:
    print (full_path_in_src ) # be like 'abc/xyz.c'
    #Full system path of this would be like => 'path till src/abc/xyz.c'

In case this may interest anyone, I've profiled the top three proposed methods.如果这可能引起任何人的兴趣,我已经对建议的前三种方法进行了概要分析。 I have about ~500K files in the globbed folder (in total), and 2K files that match the desired pattern.我在 globbed 文件夹中有大约 500K 个文件(总共),以及与所需模式匹配的 2K 个文件。

here's the (very basic) code这是(非常基本的)代码

import glob
import json
import fnmatch
import os
from pathlib import Path
from time import time


def find_files_iglob():
    return glob.iglob("./data/**/data.json", recursive=True)


def find_files_oswalk():
    for root, dirnames, filenames in os.walk('data'):
        for filename in fnmatch.filter(filenames, 'data.json'):
            yield os.path.join(root, filename)

def find_files_rglob():
    return Path('data').rglob('data.json')

t0 = time()
for f in find_files_oswalk(): pass    
t1 = time()
for f in find_files_rglob(): pass
t2 = time()
for f in find_files_iglob(): pass 
t3 = time()
print(t1-t0, t2-t1, t3-t2)

And the results I got were:我得到的结果是:
os_walk: ~3.6sec os_walk:~3.6 秒
rglob ~14.5sec rglob ~14.5 秒
iglob: ~16.9sec iglob:~16.9 秒

The platform: Ubuntu 16.04, x86_64 (core i7),平台:Ubuntu 16.04, x86_64 (core i7),

Or with a list comprehension:或者使用列表理解:

 >>> base = r"c:\User\xtofl"
 >>> binfiles = [ os.path.join(base,f) 
            for base, _, files in os.walk(root) 
            for f in files if f.endswith(".jpg") ] 

Another way to do it using just the glob module.仅使用 glob 模块的另一种方法。 Just seed the rglob method with a starting base directory and a pattern to match and it will return a list of matching file names.只需使用起始基目录和要匹配的模式为 rglob 方法设置种子,它将返回匹配文件名的列表。

import glob
import os

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list

Just made this.. it will print files and directory in hierarchical way刚刚做了这个..它将以分层方式打印文件和目录

But I didn't used fnmatch or walk但我没有使用 fnmatch 或 walk

#!/usr/bin/python

import os,glob,sys

def dirlist(path, c = 1):

        for i in glob.glob(os.path.join(path, "*")):
                if os.path.isfile(i):
                        filepath, filename = os.path.split(i)
                        print '----' *c + filename

                elif os.path.isdir(i):
                        dirname = os.path.basename(i)
                        print '----' *c + dirname
                        c+=1
                        dirlist(i,c)
                        c-=1


path = os.path.normpath(sys.argv[1])
print(os.path.basename(path))
dirlist(path)

That one uses fnmatch or regular expression:那个使用 fnmatch 或正则表达式:

import fnmatch, os

def filepaths(directory, pattern):
    for root, dirs, files in os.walk(directory):
        for basename in files:
            try:
                matched = pattern.match(basename)
            except AttributeError:
                matched = fnmatch.fnmatch(basename, pattern)
            if matched:
                yield os.path.join(root, basename)

# usage
if __name__ == '__main__':
    from pprint import pprint as pp
    import re
    path = r'/Users/hipertracker/app/myapp'
    pp([x for x in filepaths(path, re.compile(r'.*\.py$'))])
    pp([x for x in filepaths(path, '*.py')])

In addition to the suggested answers, you can do this with some lazy generation and list comprehension magic:除了建议的答案之外,您还可以使用一些惰性生成和列表理解魔法来做到这一点:

import os, glob, itertools

results = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.c'))
                                               for root, dirs, files in os.walk('src'))

for f in results: print(f)

Besides fitting in one line and avoiding unnecessary lists in memory, this also has the nice side effect, that you can use it in a way similar to the ** operator, eg, you could use os.path.join(root, 'some/path/*.c') in order to get all .c files in all sub directories of src that have this structure.除了适合一行并避免内存中不必要的列表之外,这还有一个很好的副作用,您可以以类似于 ** 运算符的方式使用它,例如,您可以使用os.path.join(root, 'some/path/*.c')以获取 src 的所有子目录中具有此结构的所有 .c 文件。

This is a working code on Python 2.7.这是 Python 2.7 上的工作代码。 As part of my devops work, I was required to write a script which would move the config files marked with live-appName.properties to appName.properties.作为我的 DevOps 工作的一部分,我需要编写一个脚本,将标有 live-appName.properties 的配置文件移动到 appName.properties。 There could be other extension files as well like live-appName.xml.可能还有其他扩展文件,例如 live-appName.xml。

Below is a working code for this, which finds the files in the given directories (nested level) and then renames (moves) it to the required filename下面是一个工作代码,它在给定的目录(嵌套级别)中找到文件,然后将其重命名(移动)为所需的文件名

def flipProperties(searchDir):
   print "Flipping properties to point to live DB"
   for root, dirnames, filenames in os.walk(searchDir):
      for filename in fnmatch.filter(filenames, 'live-*.*'):
        targetFileName = os.path.join(root, filename.split("live-")[1])
        print "File "+ os.path.join(root, filename) + "will be moved to " + targetFileName
        shutil.move(os.path.join(root, filename), targetFileName)

This function is called from a main script这个函数是从主脚本调用的

flipProperties(searchDir)

Hope this helps someone struggling with similar issues.希望这可以帮助那些在类似问题上挣扎的人。

Simplified version of Johan Dahlin's answer, without fnmatch .约翰达林的答案的简化版本,没有fnmatch

import os

matches = []
for root, dirnames, filenames in os.walk('src'):
  matches += [os.path.join(root, f) for f in filenames if f[-2:] == '.c']

Here is my solution using list comprehension to search for multiple file extensions recursively in a directory and all subdirectories:这是我使用列表理解在目录和所有子目录中递归搜索多个文件扩展名的解决方案:

import os, glob

def _globrec(path, *exts):
""" Glob recursively a directory and all subdirectories for multiple file extensions 
    Note: Glob is case-insensitive, i. e. for '\*.jpg' you will get files ending
    with .jpg and .JPG

    Parameters
    ----------
    path : str
        A directory name
    exts : tuple
        File extensions to glob for

    Returns
    -------
    files : list
        list of files matching extensions in exts in path and subfolders

    """
    dirs = [a[0] for a in os.walk(path)]
    f_filter = [d+e for d in dirs for e in exts]    
    return [f for files in [glob.iglob(files) for files in f_filter] for f in files]

my_pictures = _globrec(r'C:\Temp', '\*.jpg','\*.bmp','\*.png','\*.gif')
for f in my_pictures:
    print f

If the files are on a remote file system or inside an archive , you can use an implementation of the fsspec AbstractFileSystem class .如果文件位于远程文件系统上存档内,则可以使用fsspec AbstractFileSystem 类的实现 For example, to list all the files in a zipfile:例如,要列出 zipfile 中的所有文件:

from fsspec.implementations.zip import ZipFileSystem
fs = ZipFileSystem("/tmp/test.zip")
fs.glob("/**")  # equivalent: fs.find("/")

or to list all the files in a publicly available S3 bucket:或列出公开可用的 S3 存储桶中的所有文件:

from s3fs import S3FileSystem
fs_s3 = S3FileSystem(anon=True)
fs_s3.glob("noaa-goes16/ABI-L1b-RadF/2020/045/**")  # or use fs_s3.find

you can also use it for a local filesystem, which may be interesting if your implementation should be filesystem-agnostic:您也可以将它用于本地文件系统,如果您的实现应该与文件系统无关,这可能会很有趣:

from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
fs.glob("/tmp/test/**")

Other implementations include Google Cloud, Github, SFTP/SSH, Dropbox, and Azure.其他实现包括 Google Cloud、Github、SFTP/SSH、Dropbox 和 Azure。 For details, see the fsspec API documentation .有关详细信息,请参阅fsspec API 文档

import sys, os, glob

dir_list = ["c:\\books\\heap"]

while len(dir_list) > 0:
    cur_dir = dir_list[0]
    del dir_list[0]
    list_of_files = glob.glob(cur_dir+'\\*')
    for book in list_of_files:
        if os.path.isfile(book):
            print(book)
        else:
            dir_list.append(book)

I modified the top answer in this posting.. and recently created this script which will loop through all files in a given directory (searchdir) and the sub-directories under it... and prints filename, rootdir, modified/creation date, and size.我修改了这篇文章中的最佳答案.. 最近创建了这个脚本,它将遍历给定目录 (searchdir) 中的所有文件及其下的子目录......并打印文件名、rootdir、修改/创建日期和尺寸。

Hope this helps someone... and they can walk the directory and get fileinfo.希望这可以帮助某人......他们可以遍历目录并获取文件信息。

import time
import fnmatch
import os

def fileinfo(file):
    filename = os.path.basename(file)
    rootdir = os.path.dirname(file)
    lastmod = time.ctime(os.path.getmtime(file))
    creation = time.ctime(os.path.getctime(file))
    filesize = os.path.getsize(file)

    print "%s**\t%s\t%s\t%s\t%s" % (rootdir, filename, lastmod, creation, filesize)

searchdir = r'D:\Your\Directory\Root'
matches = []

for root, dirnames, filenames in os.walk(searchdir):
    ##  for filename in fnmatch.filter(filenames, '*.c'):
    for filename in filenames:
        ##      matches.append(os.path.join(root, filename))
        ##print matches
        fileinfo(os.path.join(root, filename))

Here is a solution that will match the pattern against the full path and not just the base filename.这是一个将模式与完整路径匹配而不仅仅是基本文件名的解决方案。

It uses fnmatch.translate to convert a glob-style pattern into a regular expression, which is then matched against the full path of each file found while walking the directory.它使用fnmatch.translate将 glob 样式的模式转换为正则表达式,然后将其与遍历目录时找到的每个文件的完整路径进行匹配。

re.IGNORECASE is optional, but desirable on Windows since the file system itself is not case-sensitive. re.IGNORECASE是可选的,但在 Windows 上是可取的,因为文件系统本身不区分大小写。 (I didn't bother compiling the regex because docs indicate it should be cached internally.) (我没有费心编译正则表达式,因为文档表明它应该在内部缓存。)

import fnmatch
import os
import re

def findfiles(dir, pattern):
    patternregex = fnmatch.translate(pattern)
    for root, dirs, files in os.walk(dir):
        for basename in files:
            filename = os.path.join(root, basename)
            if re.search(patternregex, filename, re.IGNORECASE):
                yield filename

I needed a solution for python 2.x that works fast on large directories.我需要一个在大型目录上快速运行的python 2.x解决方案。
I endet up with this:我结束了这个:

import subprocess
foundfiles= subprocess.check_output("ls src/*.c src/**/*.c", shell=True)
for foundfile in foundfiles.splitlines():
    print foundfile

Note that you might need some exception handling in case ls doesn't find any matching file.请注意,您可能需要一些异常处理,以防ls找不到任何匹配的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM