简体   繁体   English

如何搜索目录并找到与正则表达式匹配的文件?

[英]How do i search directories and find files that match regex?

I recently started getting into Python and I am having a hard time searching through directories and matching files based on a regex that I have created.我最近开始使用 Python,但我很难根据我创建的正则表达式搜索目录和匹配文件。

Basically I want it to scan through all the directories in another directory and find all the files that ends with .zip or .rar or .r01 and then run various commands based on what file it is.基本上我希望它扫描另一个目录中的所有目录并找到所有以.zip.rar.r01结尾的文件,然后根据它是什么文件运行各种命令。

import os, re

rootdir = "/mnt/externa/Torrents/completed"

for subdir, dirs, files in os.walk(rootdir):
    if re.search('(w?.zip)|(w?.rar)|(w?.r01)', files):
        print "match: " . files
import os
import re

rootdir = "/mnt/externa/Torrents/completed"
regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')

for root, dirs, files in os.walk(rootdir):
  for file in files:
    if regex.match(file):
       print(file)

CODE BELLOW ANSWERS QUESTION IN FOLLOWING COMMENT代码波纹管在以下评论中回答问题

That worked really well, is there a way to do this if match is found on regex group 1 and do this if match is found on regex group 2 etc ?效果非常好,如果在正则表达式组 1 上找到匹配,有没有办法做到这一点,如果在正则表达式组 2 上找到匹配,是否有办法做到这一点? – nillenilsson – 尼尔尼尔森

import os
import re

regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')
rx = '(.*zip$)|(.*rar$)|(.*r01$)'

for root, dirs, files in os.walk("../Documents"):
  for file in files:
    res = re.match(rx, file)
    if res:
      if res.group(1):
        print("ZIP",file)
      if res.group(2):
        print("RAR",file)
      if res.group(3):
        print("R01",file)

It might be possible to do this in a nicer way, but this works.有可能以更好的方式做到这一点,但这是有效的。

Given that you are a beginner, I would recommend using glob in place of a quickly written file-walking-regex matcher.鉴于您是初学者,我建议使用glob代替快速编写的 file-walking-regex 匹配器。

Snippets of functions using glob and a file-walking-regex matcher使用globfile-walking-regex matcher的函数片段

The below snippet contains two file-regex searching functions (one using glob and the other using a custom file-walking-regex matcher).下面的代码片段包含两个文件正则表达式搜索函数(一个使用glob ,另一个使用自定义 file-walking-regex 匹配器)。 The snippet also contains a "stopwatch" function to time the two functions.该代码段还包含一个“秒表”功能来为这两个功能计时。

import os
import sys
from datetime import timedelta
from timeit import time
import os
import re
import glob

def stopwatch(method):
    def timed(*args, **kw):
        ts = time.perf_counter()
        result = method(*args, **kw)
        te = time.perf_counter()
        duration = timedelta(seconds=te - ts)
        print(f"{method.__name__}: {duration}")
        return result
    return timed

@stopwatch
def get_filepaths_with_oswalk(root_path: str, file_regex: str):
    files_paths = []
    pattern = re.compile(file_regex)
    for root, directories, files in os.walk(root_path):
        for file in files:
            if pattern.match(file):
                files_paths.append(os.path.join(root, file))
    return files_paths


@stopwatch
def get_filepaths_with_glob(root_path: str, file_regex: str):
    return glob.glob(os.path.join(root_path, file_regex))

Comparing runtimes of the above functions比较上述函数的运行时间

On using the above two functions to find 5076 files matching the regex filename_*.csv in a dir called root_path (containing 66,948 files):使用上述两个函数在名为root_path的目录(包含 66,948 个文件)中查找与正则表达式filename_*.csv匹配的 5076 个文件:

>>> glob_files = get_filepaths_with_glob(root_path, 'filename_*.csv')
get_filepaths_with_glob: 0:00:00.176400

>>> oswalk_files = get_filepaths_with_oswalk(root_path,'filename_(.*).csv')
get_filepaths_with_oswalk: 0:03:29.385379

The glob method is much faster and the code for it is shorter. glob方法要快得多,它的代码也更短。

For your case对于您的情况

For your case, you can probably use something like the following to get your *.zip , *.rar and *.r01 files:对于您的情况,您可能可以使用以下内容来获取*.zip*.rar*.r01文件:

files = []
for ext in ['*.zip', '*.rar', '*.r01']:
    files += get_filepaths_with_glob(root_path, ext) 

Here's an alternative using glob .这是使用glob的替代方法。

from pathlib import Path

rootdir = "/mnt/externa/Torrents/completed"
for extension in 'zip rar r01'.split():
    for path in Path(rootdir).glob('*.' + extension):
        print("match: " + path)

I would do it this way:我会这样做:

import re
from pathlib import Path

def glob_re(path, regex="", glob_mask="**/*", inverse=False):
    p = Path(path)
    if inverse:
        res = [str(f) for f in p.glob(glob_mask) if not re.search(regex, str(f))]
    else:
        res = [str(f) for f in p.glob(glob_mask) if re.search(regex, str(f))]
    return res

NOTE: per default it will recursively scan all subdirectories.注意:默认情况下,它将递归扫描所有子目录。 If you want to scan only the current directory then you should explicitly specify glob_mask="*"如果只想扫描当前目录,则应明确指定glob_mask="*"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM