[英]How to recursively walk through directories and subdirectories and record top level directory information
I have the following directory structure, directory named Python-Pathlib-Scan-Directory
我有以下目录结构,目录名为
Python-Pathlib-Scan-Directory
.
.
├── File_Extension_Review_20220704.ipynb
├── File_Extension_Review_SIMCARE_20220704.ipynb
├── Project1
│ ├── data_1.1.csv
│ ├── data_1.2.xlsx
│ ├── data_3.1.xlsx
│ └── info.txt
├── Project2
│ ├── data_2.1.csv
│ ├── data_2.2.xlsx
│ └── resources.docx
├── Project3
│ └── Info.txt
├── data_1.csv
├── data_2.csv
├── data_3.csv
├── output.csv
├── script_1.py
└── script_2.ipynb
3 directories, 16 files
I want to count the frequency of file types (extensions) within using Collections Counter()
and return this as a Pandas df by passing in the results as a Dict.我想使用 Collections
Counter()
计算文件类型(扩展名)的频率,并通过将结果作为 Dict 传递来将其作为 Pandas df 返回。
I have the following code that does this我有以下代码可以做到这一点
dir_to_scan = Path("/Python-Pathlib-Scan-Directory")
all_files = []
# iterate recursively using rglob()
for i in dir_to_scan.rglob('*.*'):
if i.is_file():
all_files.append(i.suffix)
# Count values and return key:value pair denoting ext. and count
data = collections.Counter(all_files)
data
df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={"index":"Extension", 0:"Count"})
df
Output:
Extension Count
.csv 6
.ipynb 3
.py 1
.txt 2
.xlsx 3
.docx 1
My issue is that this summarises at the directory level while I want it to summarise at each level (Root directory, Project1 subdirectory, Project2 subdirectory etc.) instead so I maybe concat results together in a df, have an extra column specifying directory and show counts so I may group by later even, use path.parent
perhaps?我的问题是,这是在目录级别汇总,而我希望它在每个级别(根目录、Project1 子目录、Project2 子目录等)进行汇总,所以我可能将结果连接在一起形成一个 df,有一个额外的列指定目录并显示很重要,所以我以后甚至可以分组,也许使用
path.parent
?
Any suggestions on the best way to approach this?关于解决此问题的最佳方法的任何建议?
Also mindful that I could want to use something similar when just concatenating files in given directories and not just walking through all and concatenating all files together at once.还要注意的是,我可以在仅连接给定目录中的文件时使用类似的东西,而不仅仅是遍历所有文件并将所有文件一次连接在一起。
Using Python standard libray Pathlib module and a recursive function , here is one way to do it:使用 Python 标准库Pathlib模块和递归函数,这是一种方法:
from pathlib import Path
def scan(target, results=None):
"""Helper function that scans a directory
and its sub-directories for file extensions.
Args:
target: target directory.
results: dictionary to collect results. Defaults to None.
Returns:
dictionary which keys are the scanned directories
and values are the collected extensions.
"""
if not results:
results = {}
results[str(Path(target))] = []
for item in Path(target).glob("*"):
if not item.is_file():
scan(item, results)
else:
suffix = item.suffix if item.suffix else "no_ext"
results[str(Path(target))].append(suffix)
return results
And so, given a fake directory which contains several sub-directories and files with and without extensions:因此,给定一个假目录,其中包含几个子目录和文件,带有和不带扩展名:
from collections import Counter
import pandas as pd
results = scan(r"C:\fake_dir")
# Count values and instantiate dataframe
df = pd.DataFrame(
[dict(Counter(value)) for value in results.values()], index=results.keys()
).fillna(0)
# Sort columns ("no_ext" meaning "Files without extension" appears last)
df = df.reindex(columns=sorted(df.columns))
print(df)
# Output
.docx .ini .jpeg .jpg .pdf \
C:\fake_dir 1.0 1.0 0.0 0.0 1.0
C:\fake_dir\fake_data 0.0 0.0 0.0 4.0 0.0
C:\fake_dir\fake_data\empty_dir 0.0 0.0 0.0 0.0 0.0
C:\fake_dir\fake_data\source_dir 0.0 0.0 1.0 2.0 0.0
C:\fake_dir\fake_data\source_dir\sub_dir 0.0 0.0 0.0 0.0 0.0
.png .raw .tif no_ext
C:\fake_dir 0.0 0.0 0.0 0.0
C:\fake_dir\fake_data 0.0 0.0 0.0 1.0
C:\fake_dir\fake_data\empty_dir 0.0 0.0 0.0 0.0
C:\fake_dir\fake_data\source_dir 1.0 0.0 2.0 0.0
C:\fake_dir\fake_data\source_dir\sub_dir 0.0 1.0 0.0 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.