“find . -regex …” in Python or How to find files whose whole name (path + name) matches a regular expression?

I would like to find files whose whole name (relative, although absolute is nice too) matches a given regular expression (ie, like the glob module, but for regex matches instead of shell wildcard matches). Using find , one would do, for example:

find . -regex ./foo/\w+/bar/[0-9]+-\w+.dat

Of course, I could use find via os.system(...) or os.exec*(...) , but I'm looking for a pure Python solution. The following code combining os.walk(...) with re module regular expressions is an easy Python solution. (It's not robust and misses many (not-so-corner-ish) corner-cases, but is good enough for my single-use purpose, locating specific data files for a one-time database insertion.)

import os
import re

def find(regex, top='.'):
    matcher = re.compile(regex)
    for dirpath, dirnames, filenames in os.walk(top):
        for f in filenames:
            f = os.path.relpath(os.path.join(dirpath, f), top)
            if matcher.match(f):
                yield f

if __name__=="__main__":
    top = "."
    regex = "foo/\w+/bar/\d+-\w+.dat"
    for f in find(regex, top):
        print f

But this is inefficient. Subtrees whose contents cannot match the regex (eg, ./foo/\w+/baz/ , to continue the example from above) are unnecessarily walked. Ideally, these subtrees should be pruned from the walk; any sub-directory whose path name is not a partial match for the regex should not be traversed. (I would guess that GNU find implements such an optimization, but I have not confirmed this through tests or source-code perusal.)

Does anyone know of a Python implementation of a robust regex-based find , ideally with subtree-pruning optimization? I'm hoping that I'm just missing a method in the os.path module or some third-party module.

From help(os.walk) :

When topdown is true, the caller can modify the dirnames list in-place (eg, via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search...

So once a subdirectory (listed in dirnames ) is determined to be inadmissable, it should be deleted from dirnames . This will produce the subtree-pruning you are looking for. (Just be sure to del items from dirnames from the tail-end first, so you don't change the index of remaining items to be deleted.)

import os
import re

def prune(regex,top='.'):
    matcher = re.compile(regex)
    partial_matchers = map(
        (sep.join(pieces[:i+1]) for i in range(len(pieces))))
    for root, dirs, files in os.walk(top,topdown=True):
        for i in reversed(range(len(dirs))):
            dirname=os.path.relpath(os.path.join(root,dirs[i]), top)
            # print(dirname,dirlevel,sep.join(pieces[:dirlevel+1]))
            if not partial_matchers[dirlevel].match(dirname):
                print('pruning {0}'.format(
                    os.path.relpath(os.path.join(root,dirs[i]), top)))                
                del dirs[i]

        for filename in files:
            # print('checking {0}'.format(filename))
            if matcher.match(filename):

if __name__=='__main__':

Running the script with a directory structure like this:

~/test% tree .
|-- foo
|   `-- baz
|       |-- bad
|       |   |-- bad1.txt
|       |   `-- badbad
|       |       `-- bad2.txt
|       `-- bar
|           |-- 1-good.dat
|           `-- 2-good.dat
`-- tmp
    |-- 000.png
    |-- 001.png
    `-- output.gif


pruning tmp
pruning foo/baz/bad

If you uncomment the "checking" print statement, it is clear the pruned directories are not walked.

I wrote a function select_walk() to search for and select files in a tree of directories.

In the following exemple, files that are searched for are files with extensions .dat , .rtf , .jpeg in directories whose names match the following regex' pattern:


Note the presence of a conditional elementary pattern:


with group references (1) and \1 to the number-matching group (\d+) in elementary pattern b[ae]r(\d+)

1 )

Here's a code to create the tree of directories taken as exemple:

(take care, it first deletes directories 'foo\','fooo\','froooo\','faooo\' before creating them)

import os
from shutil import rmtree

top = 'J:\\'

for x in ('foo\\','fooo\\','froooo\\','faooo\\'):
    if os.path.isdir(top + x):
        rmtree(top + x)

li = [('foo\\',('basil\\','poto%\\','tamata\\')),




for rep,several in li:
    #print top + rep
    if os.path.isdir(top + rep) == False:
        os.mkdir(top + rep)

    for name in several:
        #print top + rep + name
        os.mkdir(top + rep + name)

for filepath in (top + 'foo\\kalaomi.xls',
                 top + 'foo\\basil\\ber89\\TURI850\\quetzal.jpeg',
                 top + 'foo\\basil\\ber89\\TURI850\\tehoi.txt',
                 top + 'foo\\poto%\\curcuma in poto%.txt',
                 top + 'foo\\poto%\\ocean\\file in ocean.rtf',
                 top + 'foo\\tamata\\vahine\\tahiti.jpeg',
                 top + 'fooo\\york#\\yorkshire.jpeg',
                 top + 'fooo\\plain\\bar999\\TURI99905\\galileo.jpeg',
                 top + 'fooo\\plain\\bar999\\TURI99905\\polynesia.dat',
                 top + 'fooo\\plain\\bar999\\TURI99905\\concrete.txt',
                 top + 'fooo\\plain\\bar999\\TURI2227\\Monroe.jpeg',
                 top + 'fooo\\plain\\bar999\\MONO2\\elastic.jpeg',
                 top + 'froooo\\one_dir\\photo in one_dir.jpeg',
                 top + 'froooo\\one_dir\\tabula.xls',
                 top + 'froooo\\one_dir\\bar25\\TURI2501\\matallelo.jpeg',
                 top + 'froooo\\one_dir\\bar25\\TURI2501\\italy.dat',
                 top + 'froooo\\one_dir\\bar25\\TURI2501\\beretta.xls',
                 top + 'froooo\\one_dir\\bar25\\TURI2501\\turi2501_ser.rtf',
                 top + 'froooo\\one_dir\\bar25\\TURI4813\\boaf_inTURI4813.jpeg',
                 top + 'froooo\\one_dir\\bar25\\TURI4813\\troui_in_TURI4813.txt',
                 top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8.dat',
                 top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8.rtf',
                 top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8.xls',
                 top + 'froooo\\one_dir\\bar25\\TURI2502\\adamante.jpeg',
                 top + 'froooo\\one_dir\\bar25\\TURI2502\\egyptic.txt',
                 top + 'froooo\\one_dir\\bar25\\TURI2502\\urubu.rtf',
                 top + 'froooo\\one_dir\\ber\\MONO532\\bacillus.jpeg',
                 top + 'froooo\\one_dir\\ber\\MONO532\\blueberry.dat',
                 top + 'froooo\\one_dir\\ber\\MONO532\\Perfume.doc',
                 top + 'faooo\\samala+\\kfaz.dat',
                 top + 'faooo\\somolo-\\ytek.rtf',
                 top + 'faooo\\123.txt',
                 top + 'faooo\\458.rtf',):
    with open(filepath,'w') as f:

This code creates the following tree:

|   |--basil
|      |--ber89
|         |--TURI850
|            |--file quetzal.jpeg
|            |--file tehoi.txt
|         |--TURI1023
|      |--ber300
|   |--poto%
|      |--ocean
|         |--file in ocean.rtf
|      |--earth
|      |--file curcuma in poto%.txt
|   |--tamata
|      |--vahine
|         |--file tahiti.jpeg
|   |--file kalaomi.xls
|  |--york#
|     |--noto
|     |--nata
|     |---file yorkshire.jpeg
|  |--plain
|     |--zx13ao
|     |--ws89rt
|     |--bar999
|        |--TURI99905
|           |--AERIAL
|              |--bumbum
|              |--corean
|           |--minidisc
|           |--file galileo.jpeg
|           |--file polynesia.dat
|           |--file concrete.txt
|        |--TURI2227
|           |--file Monroe.jpeg
|        |--MONO2
|           |--file elastic.jpeg
|  |--atlantis
|     |--atlABC
|        |--atlantis_sound
|        |--atlantis_image
|     |--atlDEFG
|  |--one_dir
|     |--bar25
|        |--TURI2501
|           |--file matalello.jpeg
|           |--file italy.dat
|           |--file beretta.xls
|           |--file turi2501_ser.rtf
|        |--TURI2502
|           |--file adamante.jpeg
|           |--file egyptic.txt
|           |--file urubu.rtf
|        |--TURI4813
|           |--file boaf_inTURI4813.jpeg
|           |--file troui_inTURI4813.txt
|        |--MONO8
|           |--file in_mono8.dat
|           |--file in_mono8.rtf
|           |--file in_mono8.xls
|     |--ber
|        |--TURI30
|        |--TURI
|        |--MONO532
|           |--file bacillus.jpeg
|           |--file blueberry.dat
|           |--file Perfume.doc
|     |--file photo in one_dir.jpeg
|     |--file tabula.xls
|  |--another_dir
|     |--notseen
|     |--notseen2
|  |--somolo-
|     |--file ytek.rtf
|  |--samala+
|     |file kfaz.dat
|  |--file 123.txt
|  |--file 458.rtf

The pattern of the regex that matches the files is:


and the directories selectively explored to search for this kind of files will be the following ones:



2 )

As a preliminary demonstration, here's a code that shows the functionning of the part of the select_walk() function's code that builds the regexes necessary to explore only selected directories during the iterated walk in a tree and to return selected files:

import re

def compute_regexes(pat_file, displ = True):
    from os import sep

    splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)

    pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])

    if displ:
        print ('IN FUNCTION compute_regexes() :'
               '\n\npat_file== %s'
               '\n\nsplitted_pat :\n%s'
               '\n\npat_parent_dir== %s\n') \
              % (pat_file , '\n'.join(splitted_pat) , pat_parent_dir)

    dgr = {}
    for i,el in enumerate(splitted_pat):
        if re.search('\(.*?\)',el):
            dgr[len(dgr)+1] = i
    if displ:
        print 'dgr :'
        print '\n'.join('group(%s) is in splitted_pat[%s]' % (g,i)
                        for g,i in dgr.iteritems())

    def repl(mat, dgr = dgr):
        the = int(mat.group(1) if mat.group(1) else mat.group(2))
        return str(the + dgr[the])

    for i,el in enumerate(splitted_pat):
        splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)

    pat_dirs = ''
    for x in splitted_pat[-2:0:-1]:
        pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
    pat_dirs = splitted_pat[0] + pat_dirs
    if displ:
        print '\npat_dirs==',pat_dirs

    return (re.compile(pat_file), re.compile(pat_dirs), re.compile(pat_parent_dir) )

pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file)

print '\n\nEXAMPLES with regx_file :\n'
print 'pat_file==',pat_file
for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru.rtf  ',
                 'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.jpeg  '):
    print filepath,bool(regx_file.match(filepath))

print '\n\nEXAMPLES with regx_dirs :\n'
for path in ('J:\\fooo',
    print path,("   : ~~ this dir's name is OK ~~" if path==''.join(regx_dirs.match(path).group())
                else "   : ## this dir's name doesn't match ##")

The function compute_regexes() first splits the original pat_file regex' pattern into elements aimed at matching names of directories in a path.

Then it computes:

  • a regex' pattern pat_dirs to match the different levels of path of the including directories of a wanted file

  • a regex' pattern pat_parent_dir that matches any direct parent directory of a wanted file


The treatment implying dgr and the function repl() is a sophistication that allows the function compute_regexes() to take account of the group's references (id est: special sequences \1 \2 etc) and to change them to obtain pat_dirs with group's references still correct relatively to the added parentheses introduced to create pat_dirs .

Result of this code:

IN FUNCTION compute_regexes() :

pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)

splitted_pat :

pat_parent_dir== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)

dgr :
group(1) is in splitted_pat[3]
group(2) is in splitted_pat[4]
group(3) is in splitted_pat[5]

pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?

EXAMPLES with regx_file :

pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)
J:\fooo\basil\ber92\TURI9258\beru.rtf   True
J:\froooooo\ki_ki\bar\MONO47\madrid.jpeg   True

EXAMPLES with regx_dirs :

J:\fooo    : ~~ this dir's name is OK ~~
J:\fooo\basil    : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92    : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92\TURI777    : ## this dir's name doesn't match ##
J:\fooo\basil\ber92\TURI9258    : ~~ this dir's name is OK ~~
J:\frooooooJ:\froooooo\ki_ki    : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar    : ~~ this dir's name is OK ~~
J:\froooooo\ki=ki\bar    : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar\MONO47    : ~~ this dir's name is OK ~~



3 )

Finally, here's the function


that does the job of searching for files in a tree whose names match a certain regex:
it yields the triples (dirpath, dirnames, filenames) returned by the built-in os.walk() function, but only those whose directory filenames contains correct file's names matching pat_file .

Of course, during the iteration, the function select_walk() doesn't explore the directories whose files content will never match the key regex' pattern pat_file because of their (directories') names.

def select_walk(pat_file,start_dir):

    from os import sep

    splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)

    pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])

    dgr = {}
    for i,el in enumerate(splitted_pat):
        if re.search('\(.*?\)',el):
            dgr[len(dgr)+1] = i

    def repl(mat, dgr = dgr):
        the = int(mat.group(1) if mat.group(1) else mat.group(2))
        return str(the + dgr[the])

    for i,el in enumerate(splitted_pat):
        splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)

    pat_dirs = ''
    for x in splitted_pat[-2:0:-1]:
        pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
    pat_dirs = splitted_pat[0] + pat_dirs
    print 'pat_dirs==',pat_dirs

    regx_file = re.compile(pat_file)
    regx_dirs = re.compile(pat_dirs)
    regx_parent_dir = re.compile(pat_parent_dir)

    start_dir = start_dir.rstrip(sep) + sep
    print '\nstart_dir == '+start_dir

    for dirpath,dirnames,filenames in os.walk(start_dir):

        dirpath = dirpath.rstrip(sep)
        print '\n'.join(('explored dirpath : %s    is_direct_parent: %s' \
                         % (dirpath,('NO','YES')[bool(regx_parent_dir.match(dirpath))]),
                         '           dirnames  : %s' % dirnames,
                         '          filenames  : %s' % filenames))

        if regx_parent_dir.match(dirpath):
            filenames[:] = [filename for filename in filenames
                            if regx_file.match(dirpath + sep + filename)]
            dirnames[:] = []
            print '\n'.join(('           dirnames  : not to be explored ' ,
                             '  yielded filenames  : %s\n' % filenames)) 
            yield (dirpath,dirnames,filenames)

            dirnames[:] = [dirname for dirname in dirnames
                           if regx_dirs.match(dirpath + sep + dirname).group()==dirpath + sep + dirname]
            print '\n'.join(('dirnames to explore  : %s ' % dirnames,
                             '          filenames  : not to be yielded\n')) 

pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
print '\n\nSELECTED (dirpath, dirnames, filenames) :\n' + '\n'.join(map(repr, select_walk(pat_file,'J:\\')))


pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?

start_dir == J:\
explored dirpath : J:    is_direct_parent: NO
           dirnames  : ['Amazon', 'faooo', 'Favorites', 'foo', 'fooo', 'froooo', 'Python', 'RECYCLER', 'System Volume Information']
          filenames  : ['image00.pfm', 'rep.py']
dirnames to explore  : ['foo', 'fooo', 'froooo'] 
          filenames  : not to be yielded

explored dirpath : J:\foo    is_direct_parent: NO
           dirnames  : ['basil', 'poto%', 'tamata']
          filenames  : ['kalaomi.xls']
dirnames to explore  : ['basil', 'tamata'] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil    is_direct_parent: NO
           dirnames  : ['ber300', 'ber89']
          filenames  : []
dirnames to explore  : ['ber300', 'ber89'] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil\ber300    is_direct_parent: NO
           dirnames  : []
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil\ber89    is_direct_parent: NO
           dirnames  : ['TURI1023', 'TURI850']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\foo\tamata    is_direct_parent: NO
           dirnames  : ['vahine']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\fooo    is_direct_parent: NO
           dirnames  : ['atlantis', 'plain', 'york#']
          filenames  : []
dirnames to explore  : ['atlantis', 'plain'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\atlantis    is_direct_parent: NO
           dirnames  : ['atlABC', 'atlDEFG']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain    is_direct_parent: NO
           dirnames  : ['bar999', 'ws89rt', 'zx13ao']
          filenames  : []
dirnames to explore  : ['bar999'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain\bar999    is_direct_parent: NO
           dirnames  : ['MONO2', 'TURI2227', 'TURI99905']
          filenames  : []
dirnames to explore  : ['TURI99905'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain\bar999\TURI99905    is_direct_parent: YES
           dirnames  : ['AERIAL', 'minidisc']
          filenames  : ['concrete.txt', 'galileo.jpeg', 'polynesia.dat']
           dirnames  : not to be explored 
  yielded filenames  : ['galileo.jpeg', 'polynesia.dat']

explored dirpath : J:\froooo    is_direct_parent: NO
           dirnames  : ['another_dir', 'one_dir']
          filenames  : []
dirnames to explore  : ['another_dir', 'one_dir'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\another_dir    is_direct_parent: NO
           dirnames  : ['notseen', 'notseen2']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir    is_direct_parent: NO
           dirnames  : ['bar25', 'ber']
          filenames  : ['photo in one_dir.jpeg', 'tabula.xls']
dirnames to explore  : ['bar25', 'ber'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\bar25    is_direct_parent: NO
           dirnames  : ['MONO8', 'TURI2501', 'TURI2502', 'TURI4813']
          filenames  : []
dirnames to explore  : ['TURI2501', 'TURI2502'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\bar25\TURI2501    is_direct_parent: YES
           dirnames  : []
          filenames  : ['beretta.xls', 'italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']
           dirnames  : not to be explored 
  yielded filenames  : ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']

explored dirpath : J:\froooo\one_dir\bar25\TURI2502    is_direct_parent: YES
           dirnames  : []
          filenames  : ['adamante.jpeg', 'egyptic.txt', 'urubu.rtf']
           dirnames  : not to be explored 
  yielded filenames  : ['adamante.jpeg', 'urubu.rtf']

explored dirpath : J:\froooo\one_dir\ber    is_direct_parent: NO
           dirnames  : ['MONO532', 'TURI', 'TURI30']
          filenames  : []
dirnames to explore  : ['MONO532'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\ber\MONO532    is_direct_parent: YES
           dirnames  : []
          filenames  : ['bacillus.jpeg', 'blueberry.dat', 'Perfume.doc']
           dirnames  : not to be explored 
  yielded filenames  : ['bacillus.jpeg', 'blueberry.dat']

SELECTED (dirpath, dirnames, filenames) :
('J:\\fooo\\plain\\bar999\\TURI99905', [], ['galileo.jpeg', 'polynesia.dat'])
('J:\\froooo\\one_dir\\bar25\\TURI2501', [], ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf'])
('J:\\froooo\\one_dir\\bar25\\TURI2502', [], ['adamante.jpeg', 'urubu.rtf'])
('J:\\froooo\\one_dir\\ber\\MONO532', [], ['bacillus.jpeg', 'blueberry.dat'])

