简体   繁体   English

清理文件夹列表,仅保留每组文件夹中的顶级文件夹

[英]Cleaning up a list of folders, only keeping the top-level folder from each group of folders

I have just started programming Python and hope some of you experienced could give me a hint about how to optimizing the code below. 我刚刚开始编程Python,希望你们中的一些人可以给我一个如何优化下面代码的提示。

What I am trying to do is to go through a list of folders making a new list only containing the top-level folder from each group of folders. 我要做的是浏览一个文件夹列表,创建一个新列表,其中只包含每组文件夹中的顶级文件夹。

I have struggled and written the code below, which does the job, but scale terribly when used lists containing thousands of folders. 我一直在努力并编写下面的代码,它完成了这项工作,但是当使用包含数千个文件夹的列表时,它的扩展性非常大。

Any ides how to optimize this routine are most welcome. 任何想法如何优化这个例程是非常受欢迎的。

folderlist = [  "c:\\temp\\data\\1122 AA",\
                "c:\\temp\\data\\1122 AA\\Div",\
                "c:\\temp\\data\\1122 AA\\Div\\Etc",\
                "c:\\temp\\data\\1122 AA\\Div\\Etc2",\
                "c:\\temp\\server1\\div\\2244_BB",\
                "c:\\temp\\server1\\div\\2244_BB\\pp",\
                "c:\\temp\\server1\\div\\2244_BB\\der\\dedd",\
                "c:\\temp\\server1\\div\\2244_BB\\defwe23d\\23ded",\
                "c:\\temp\\123456789-BB",\
                "c:\\temp\\123456789-BB\\pp",\
                "c:\\temp\\123456789-BB\\der\\dee32d",\
                "c:\\temp\\data\\123456789-BB\\ded\\ve_23"]

l2 = folderlist.copy()
ind = []
indexes_to_be_deleted = []

for el in l2:
    for idx, x in enumerate(l2):
        if el in x:
            ind.append(idx)

counts = Counter(ind)

for l, count in counts.most_common():
    if count > 1:
        indexes_to_be_deleted.append(l)    

for i in sorted(indexes_to_be_deleted, reverse=True): 
    del folderlist[i]

Output:
c:\\temp\\data\\1122 AA\\
c:\\temp\\server1\\div\\2244_BB\\
c:\\temp\\123456789-BB\\

The output is as expected, only the top-level folder from each group of folders. 输出是预期的,只是每组文件夹中的顶级文件夹。 However, I hope some of you have an idea how to make the routine faster. 但是,我希望你们中的一些人知道如何更快地完成例程。

I would suggest adding to a new list rather than removing items: 我建议添加到新列表而不是删除项目:

topFolders = [] 
for name in folderlist:  # sorted(folderlist) if they are not already in order
    if topFolders and name.startswith(topFolders[-1]+"\\"): continue
    topFolders.append(name)

you can assign it to the original list if necessary 如有必要,您可以将其分配给原始列表

folderlist = topFolders

I thought I would post my somewhat over-engineered, recursive, tree-based solution since (a) I wrote it before seeing (and upvoting) Alain T.'s answer, and (b) because I think it should be asymptotically faster for unsorted input ( O(n) vs O(n.log(n)) ) than sorting the list - though for mere thousands of paths sorting may well be faster than all this hashing etc. 我想我会后我有点过度设计,递归,基于树的解决方案,因为(一)我看到(和upvoting)阿兰·T.的回答,和(b),因为我认为这应该是渐进更快之前写的未排序的输入( O(n) vs O(n.log(n)) )而不是排序列表 - 尽管只有数千个路径排序可能比所有这些散列等更快。

from collections import defaultdict

def new_node():
    return defaultdict(new_node)

def insert_into_tree(tree, full_path, split_path):
    top_dir, *rest_of_path = split_path

    if isinstance(tree[top_dir], str):
        # A shorter path is already in the tree! Throw this path away.
        return None

    if not rest_of_path:
        # Store the full path at this leaf.
        tree[top_dir] = full_path
        return full_path

    return insert_into_tree(tree[top_dir], full_path, rest_of_path)

def get_shortest_paths(tree, paths):
    for dir_name, child in tree.items():
        if isinstance(child, str):
            paths.append(child)
        else:
            get_shortest_paths(child, paths)

folder_list = [ ... ]
folder_tree = new_node()

for full_path in folder_list:
    insert_into_tree(folder_tree, full_path, full_path.split("\\"))

shortest_paths = []
get_shortest_paths(folder_tree, shortest_paths)

print(shortest_paths)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:查找名称中仅包含数字的顶级文件夹 - Python: find top-level folders that have only digits in names 将文件夹中的文件移动到顶级目录 - Move files in folders to a top-level directory 如何列出给定 GCS 存储桶中的所有顶级文件夹? - How do I list all the top-level folders in given GCS bucket? 如何使用boto3在S3存储桶中获取顶级文件夹? - How to get top-level folders in an S3 bucket using boto3? 从具有不同顶级文件夹名称的Django项目导入 - Import from a Django project with a different top-level folder name 从列表中创建文件夹和子文件夹,每个文件夹和子文件夹中包含 1 个文件 - Create folders and subfolders from list with 1 file in each 使用python将文件夹从文件夹列表移动到其他文件夹列表 - move folders from folder list to other folder list using python 仅按QSortFilterProxyModel中的顶级项目过滤 - Filter only by top-level items in QSortFilterProxyModel 在不同目录中保持具有相同顶级名称的 Python 包 - Keeping Python packages with the same top-level name in different directories 如何打印脚本的每一行,因为它仅针对正在运行的顶级脚本运行? - How to print each line of a script as it is run only for the top-level script being run?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM