简体   繁体   中英

Improve efficiency of python os.walk + regular expression algorithm

I'm using os.walk to select files from a specific folder which match a regular expression.

for dirpath, dirs, files in os.walk(str(basedir)):
    files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]
    print dirpath, dirs, files

But this has to process all files and folders under basedir, which is quite time consuming. I'm looking for a way to use the same regular expression used for files to filter out unwanted directories in each step of the walk. Or a way to match only part of the regex...

For example, in a structure like

/data/2013/07/19/file.dat

using eg the following regular expression

/data/(?P<year>2013)/(?P<month>07)/(?P<day>19)/(?P<filename>.*\.dat)

find all .dat files without needing to look into eg /data/2012

If, for example, you want only files in /data/2013/07/19 to be processed, just start the os.walk() from directory top /data/2013/07/19 . This is similar to Tommi Komulainen's suggestion, but you needn't modify the loop code.

I stumbled upon this problem (it's pretty clear what the problem is, even if there's no actual question) so since no one answered I guess it might be useful even if quite late.

You need to split the original RE into segments, so you can filter intermediate directories inside the loop. Filter, and then match the files.

regex_parts = regex.split("/")
del regex_parts[0]  # Because [0] = "" it's not needed

for base, dirs, files in os.walk(root):
   if len(regex_parts) > 1:
       dirs[:] = [dir for dir in dirs if re.match(regex_parts[0], dir)]
       regex_parts[:] = regex_parts[1:]
       continue

   files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]

Since you are matching files (the last part of the path), there's no reason to do the actual match until you filter out as much as possible. The len check is there so directories that might match the last part don't get clobbered. This could possibly be made more efficient, but it worked for me (I just today had a similar problem).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM