如何递归遍历目录和子目录并记录顶级目录信息

Question

I have the following directory structure, directory named Python-Pathlib-Scan-Directory我有以下目录结构，目录名为Python-Pathlib-Scan-Directory

.
.
├── File_Extension_Review_20220704.ipynb
├── File_Extension_Review_SIMCARE_20220704.ipynb
├── Project1
│   ├── data_1.1.csv
│   ├── data_1.2.xlsx
│   ├── data_3.1.xlsx
│   └── info.txt
├── Project2
│   ├── data_2.1.csv
│   ├── data_2.2.xlsx
│   └── resources.docx
├── Project3
│   └── Info.txt
├── data_1.csv
├── data_2.csv
├── data_3.csv
├── output.csv
├── script_1.py
└── script_2.ipynb

3 directories, 16 files

I want to count the frequency of file types (extensions) within using Collections Counter() and return this as a Pandas df by passing in the results as a Dict.我想使用 Collections Counter()计算文件类型（扩展名）的频率，并通过将结果作为 Dict 传递来将其作为 Pandas df 返回。

I have the following code that does this我有以下代码可以做到这一点

dir_to_scan = Path("/Python-Pathlib-Scan-Directory")


all_files = []
# iterate recursively using rglob()
for i in dir_to_scan.rglob('*.*'):
    if i.is_file():
        all_files.append(i.suffix)

# Count values and return key:value pair denoting ext. and count
data = collections.Counter(all_files)
data

df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={"index":"Extension", 0:"Count"})
df

Output:

Extension   Count
.csv        6
.ipynb      3
.py         1
.txt        2
.xlsx       3
.docx       1

My issue is that this summarises at the directory level while I want it to summarise at each level (Root directory, Project1 subdirectory, Project2 subdirectory etc.) instead so I maybe concat results together in a df, have an extra column specifying directory and show counts so I may group by later even, use path.parent perhaps?我的问题是，这是在目录级别汇总，而我希望它在每个级别（根目录、Project1 子目录、Project2 子目录等）进行汇总，所以我可能将结果连接在一起形成一个 df，有一个额外的列指定目录并显示很重要，所以我以后甚至可以分组，也许使用path.parent ？

Any suggestions on the best way to approach this?关于解决此问题的最佳方法的任何建议？

Also mindful that I could want to use something similar when just concatenating files in given directories and not just walking through all and concatenating all files together at once.还要注意的是，我可以在仅连接给定目录中的文件时使用类似的东西，而不仅仅是遍历所有文件并将所有文件一次连接在一起。

Answer 1

Using Python standard libray Pathlib module and a recursive function , here is one way to do it:使用 Python 标准库Pathlib模块和递归函数，这是一种方法：

from pathlib import Path

def scan(target, results=None):
    """Helper function that scans a directory
    and its sub-directories for file extensions.

    Args:
        target: target directory.
        results: dictionary to collect results. Defaults to None.

    Returns:
        dictionary which keys are the scanned directories
        and values are the collected extensions.

    """
    if not results:
        results = {}
    results[str(Path(target))] = []
    for item in Path(target).glob("*"):
        if not item.is_file():
            scan(item, results)
        else:
            suffix = item.suffix if item.suffix else "no_ext"
            results[str(Path(target))].append(suffix)
    return results

And so, given a fake directory which contains several sub-directories and files with and without extensions:因此，给定一个假目录，其中包含几个子目录和文件，带有和不带扩展名：

from collections import Counter

import pandas as pd

results = scan(r"C:\fake_dir")

# Count values and instantiate dataframe
df = pd.DataFrame(
    [dict(Counter(value)) for value in results.values()], index=results.keys()
).fillna(0)

# Sort columns ("no_ext" meaning "Files without extension" appears last)
df = df.reindex(columns=sorted(df.columns))

print(df)
# Output
                                          .docx  .ini  .jpeg  .jpg  .pdf  \
C:\fake_dir                                 1.0   1.0    0.0   0.0   1.0   
C:\fake_dir\fake_data                       0.0   0.0    0.0   4.0   0.0   
C:\fake_dir\fake_data\empty_dir             0.0   0.0    0.0   0.0   0.0   
C:\fake_dir\fake_data\source_dir            0.0   0.0    1.0   2.0   0.0   
C:\fake_dir\fake_data\source_dir\sub_dir    0.0   0.0    0.0   0.0   0.0   

                                          .png  .raw  .tif  no_ext  
C:\fake_dir                                0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data                      0.0   0.0   0.0     1.0  
C:\fake_dir\fake_data\empty_dir            0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data\source_dir           1.0   0.0   2.0     0.0  
C:\fake_dir\fake_data\source_dir\sub_dir   0.0   1.0   0.0     1.0

如何递归遍历目录和子目录并记录顶级目录信息

问题描述

1 个解决方案

解决方案1
0 2022-07-11 17:59:48

如何递归遍历目录和子目录并记录顶级目录信息

问题描述

1 个解决方案

解决方案1 0 2022-07-11 17:59:48

解决方案1
0 2022-07-11 17:59:48