简体   繁体   English

如何递归遍历目录和子目录并记录顶级目录信息

[英]How to recursively walk through directories and subdirectories and record top level directory information

I have the following directory structure, directory named Python-Pathlib-Scan-Directory我有以下目录结构,目录名为Python-Pathlib-Scan-Directory

.
.
├── File_Extension_Review_20220704.ipynb
├── File_Extension_Review_SIMCARE_20220704.ipynb
├── Project1
│   ├── data_1.1.csv
│   ├── data_1.2.xlsx
│   ├── data_3.1.xlsx
│   └── info.txt
├── Project2
│   ├── data_2.1.csv
│   ├── data_2.2.xlsx
│   └── resources.docx
├── Project3
│   └── Info.txt
├── data_1.csv
├── data_2.csv
├── data_3.csv
├── output.csv
├── script_1.py
└── script_2.ipynb

3 directories, 16 files

I want to count the frequency of file types (extensions) within using Collections Counter() and return this as a Pandas df by passing in the results as a Dict.我想使用 Collections Counter()计算文件类型(扩展名)的频率,并通过将结果作为 Dict 传递来将其作为 Pandas df 返回。

I have the following code that does this我有以下代码可以做到这一点

dir_to_scan = Path("/Python-Pathlib-Scan-Directory")


all_files = []
# iterate recursively using rglob()
for i in dir_to_scan.rglob('*.*'):
    if i.is_file():
        all_files.append(i.suffix)

# Count values and return key:value pair denoting ext. and count
data = collections.Counter(all_files)
data

df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={"index":"Extension", 0:"Count"})
df

Output:

Extension   Count
.csv        6
.ipynb      3
.py         1
.txt        2
.xlsx       3
.docx       1


My issue is that this summarises at the directory level while I want it to summarise at each level (Root directory, Project1 subdirectory, Project2 subdirectory etc.) instead so I maybe concat results together in a df, have an extra column specifying directory and show counts so I may group by later even, use path.parent perhaps?我的问题是,这是在目录级别汇总,而我希望它在每个级别(根目录、Project1 子目录、Project2 子目录等)进行汇总,所以我可能将结果连接在一起形成一个 df,有一个额外的列指定目录并显示很重要,所以我以后甚至可以分组,也许使用path.parent

Any suggestions on the best way to approach this?关于解决此问题的最佳方法的任何建议?

Also mindful that I could want to use something similar when just concatenating files in given directories and not just walking through all and concatenating all files together at once.还要注意的是,我可以在仅连接给定目录中的文件时使用类似的东西,而不仅仅是遍历所有文件并将所有文件一次连接在一起。

Using Python standard libray Pathlib module and a recursive function , here is one way to do it:使用 Python 标准库Pathlib模块和递归函数,这是一种方法:

from pathlib import Path

def scan(target, results=None):
    """Helper function that scans a directory
    and its sub-directories for file extensions.

    Args:
        target: target directory.
        results: dictionary to collect results. Defaults to None.

    Returns:
        dictionary which keys are the scanned directories
        and values are the collected extensions.

    """
    if not results:
        results = {}
    results[str(Path(target))] = []
    for item in Path(target).glob("*"):
        if not item.is_file():
            scan(item, results)
        else:
            suffix = item.suffix if item.suffix else "no_ext"
            results[str(Path(target))].append(suffix)
    return results

And so, given a fake directory which contains several sub-directories and files with and without extensions:因此,给定一个假目录,其中包含几个子目录和文件,带有和不带扩展名:

from collections import Counter

import pandas as pd

results = scan(r"C:\fake_dir")

# Count values and instantiate dataframe
df = pd.DataFrame(
    [dict(Counter(value)) for value in results.values()], index=results.keys()
).fillna(0)

# Sort columns ("no_ext" meaning "Files without extension" appears last)
df = df.reindex(columns=sorted(df.columns))
print(df)
# Output
                                          .docx  .ini  .jpeg  .jpg  .pdf  \
C:\fake_dir                                 1.0   1.0    0.0   0.0   1.0   
C:\fake_dir\fake_data                       0.0   0.0    0.0   4.0   0.0   
C:\fake_dir\fake_data\empty_dir             0.0   0.0    0.0   0.0   0.0   
C:\fake_dir\fake_data\source_dir            0.0   0.0    1.0   2.0   0.0   
C:\fake_dir\fake_data\source_dir\sub_dir    0.0   0.0    0.0   0.0   0.0   

                                          .png  .raw  .tif  no_ext  
C:\fake_dir                                0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data                      0.0   0.0   0.0     1.0  
C:\fake_dir\fake_data\empty_dir            0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data\source_dir           1.0   0.0   2.0     0.0  
C:\fake_dir\fake_data\source_dir\sub_dir   0.0   1.0   0.0     1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果找到某些文件,如何简单地遍历目录和子目录并创建存档 - How to simply walk through directories and subdirectories and create archive if found certain files Python 中是否有一种方法可以在不使用 os.walk、glob 或 fnmatch 的情况下递归搜索目录、子目录和文件? - Is there a way in Python to search directories, subdirectories and files recursively without using os.walk, glob, or fnmatch? os.walk但目录在顶部? - os.walk but with directories on top? 目录遍历和删除文件/目录 - Directory walk and remove files/directories 你如何使用python遍历目录? - How do you walk through the directories using python? 如何遍历目录中的子目录并计算python中子目录中的文件 - How to iterate through subdirectories in a directory and count the files in the subdirectories in python 如何递归遍历所有子目录和读取文件? - How to recursively go through all subdirectories and read files? 使用os.walk搜索目录位置(而不要检查同一级别的其他目录) - search for the directory locations with os.walk (and not to check the other directories the same level) 如何优化目录列表以删除新添加目录的子目录 - How to refine list of directories to remove subdirectories of newly added directory 如何仅列出 Python 中的顶级目录? - How to list only top level directories in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM