[英]Python, Pandas: Faster File Search than os.path?
我有一个 pandas df,其文件名需要在目录树中进行搜索/匹配。
我一直在使用以下内容,但它会因较大的目录结构而崩溃。 我记录它们是否出现在 2 个列表中。
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
我读过 scandir 更快,并且可以处理更大的目录树。 如果是真的,这怎么可能被重写?
我的尝试:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
这运行(快速),但一切都在“错过”列表中结束。
只扫描一次目录并将其转换为数据框。
我的venv
目录上的示例:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
现在你可以使用df_path
从df_files
中merge
文件名了:
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.