[英]How do i search directories and find files that match regex?
I recently started getting into Python and I am having a hard time searching through directories and matching files based on a regex that I have created.我最近开始使用 Python,但我很难根据我创建的正则表达式搜索目录和匹配文件。
Basically I want it to scan through all the directories in another directory and find all the files that ends with .zip
or .rar
or .r01
and then run various commands based on what file it is.基本上我希望它扫描另一个目录中的所有目录并找到所有以
.zip
或.rar
或.r01
结尾的文件,然后根据它是什么文件运行各种命令。
import os, re
rootdir = "/mnt/externa/Torrents/completed"
for subdir, dirs, files in os.walk(rootdir):
if re.search('(w?.zip)|(w?.rar)|(w?.r01)', files):
print "match: " . files
import os
import re
rootdir = "/mnt/externa/Torrents/completed"
regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')
for root, dirs, files in os.walk(rootdir):
for file in files:
if regex.match(file):
print(file)
CODE BELLOW ANSWERS QUESTION IN FOLLOWING COMMENT代码波纹管在以下评论中回答问题
That worked really well, is there a way to do this if match is found on regex group 1 and do this if match is found on regex group 2 etc ?
效果非常好,如果在正则表达式组 1 上找到匹配,有没有办法做到这一点,如果在正则表达式组 2 上找到匹配,是否有办法做到这一点? – nillenilsson
– 尼尔尼尔森
import os
import re
regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')
rx = '(.*zip$)|(.*rar$)|(.*r01$)'
for root, dirs, files in os.walk("../Documents"):
for file in files:
res = re.match(rx, file)
if res:
if res.group(1):
print("ZIP",file)
if res.group(2):
print("RAR",file)
if res.group(3):
print("R01",file)
It might be possible to do this in a nicer way, but this works.有可能以更好的方式做到这一点,但这是有效的。
Given that you are a beginner, I would recommend using glob
in place of a quickly written file-walking-regex matcher.鉴于您是初学者,我建议使用
glob
代替快速编写的 file-walking-regex 匹配器。
glob
and a file-walking-regex matcher
glob
和file-walking-regex matcher
的函数片段The below snippet contains two file-regex searching functions (one using glob
and the other using a custom file-walking-regex matcher).下面的代码片段包含两个文件正则表达式搜索函数(一个使用
glob
,另一个使用自定义 file-walking-regex 匹配器)。 The snippet also contains a "stopwatch" function to time the two functions.该代码段还包含一个“秒表”功能来为这两个功能计时。
import os
import sys
from datetime import timedelta
from timeit import time
import os
import re
import glob
def stopwatch(method):
def timed(*args, **kw):
ts = time.perf_counter()
result = method(*args, **kw)
te = time.perf_counter()
duration = timedelta(seconds=te - ts)
print(f"{method.__name__}: {duration}")
return result
return timed
@stopwatch
def get_filepaths_with_oswalk(root_path: str, file_regex: str):
files_paths = []
pattern = re.compile(file_regex)
for root, directories, files in os.walk(root_path):
for file in files:
if pattern.match(file):
files_paths.append(os.path.join(root, file))
return files_paths
@stopwatch
def get_filepaths_with_glob(root_path: str, file_regex: str):
return glob.glob(os.path.join(root_path, file_regex))
On using the above two functions to find 5076 files matching the regex filename_*.csv
in a dir called root_path
(containing 66,948 files):使用上述两个函数在名为
root_path
的目录(包含 66,948 个文件)中查找与正则表达式filename_*.csv
匹配的 5076 个文件:
>>> glob_files = get_filepaths_with_glob(root_path, 'filename_*.csv')
get_filepaths_with_glob: 0:00:00.176400
>>> oswalk_files = get_filepaths_with_oswalk(root_path,'filename_(.*).csv')
get_filepaths_with_oswalk: 0:03:29.385379
The glob
method is much faster and the code for it is shorter. glob
方法要快得多,它的代码也更短。
For your case, you can probably use something like the following to get your *.zip
, *.rar
and *.r01
files:对于您的情况,您可能可以使用以下内容来获取
*.zip
、 *.rar
和*.r01
文件:
files = []
for ext in ['*.zip', '*.rar', '*.r01']:
files += get_filepaths_with_glob(root_path, ext)
Here's an alternative using glob
.这是使用
glob
的替代方法。
from pathlib import Path
rootdir = "/mnt/externa/Torrents/completed"
for extension in 'zip rar r01'.split():
for path in Path(rootdir).glob('*.' + extension):
print("match: " + path)
I would do it this way:我会这样做:
import re
from pathlib import Path
def glob_re(path, regex="", glob_mask="**/*", inverse=False):
p = Path(path)
if inverse:
res = [str(f) for f in p.glob(glob_mask) if not re.search(regex, str(f))]
else:
res = [str(f) for f in p.glob(glob_mask) if re.search(regex, str(f))]
return res
NOTE: per default it will recursively scan all subdirectories.注意:默认情况下,它将递归扫描所有子目录。 If you want to scan only the current directory then you should explicitly specify
glob_mask="*"
如果只想扫描当前目录,则应明确指定
glob_mask="*"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.