清理文件夹列表，仅保留每组文件夹中的顶级文件夹

Question

I have just started programming Python and hope some of you experienced could give me a hint about how to optimizing the code below. 我刚刚开始编程Python，希望你们中的一些人可以给我一个如何优化下面代码的提示。

What I am trying to do is to go through a list of folders making a new list only containing the top-level folder from each group of folders. 我要做的是浏览一个文件夹列表，创建一个新列表，其中只包含每组文件夹中的顶级文件夹。

I have struggled and written the code below, which does the job, but scale terribly when used lists containing thousands of folders. 我一直在努力并编写下面的代码，它完成了这项工作，但是当使用包含数千个文件夹的列表时，它的扩展性非常大。

Any ides how to optimize this routine are most welcome. 任何想法如何优化这个例程是非常受欢迎的。

folderlist = [  "c:\\temp\\data\\1122 AA",\
                "c:\\temp\\data\\1122 AA\\Div",\
                "c:\\temp\\data\\1122 AA\\Div\\Etc",\
                "c:\\temp\\data\\1122 AA\\Div\\Etc2",\
                "c:\\temp\\server1\\div\\2244_BB",\
                "c:\\temp\\server1\\div\\2244_BB\\pp",\
                "c:\\temp\\server1\\div\\2244_BB\\der\\dedd",\
                "c:\\temp\\server1\\div\\2244_BB\\defwe23d\\23ded",\
                "c:\\temp\\123456789-BB",\
                "c:\\temp\\123456789-BB\\pp",\
                "c:\\temp\\123456789-BB\\der\\dee32d",\
                "c:\\temp\\data\\123456789-BB\\ded\\ve_23"]

l2 = folderlist.copy()
ind = []
indexes_to_be_deleted = []

for el in l2:
    for idx, x in enumerate(l2):
        if el in x:
            ind.append(idx)

counts = Counter(ind)

for l, count in counts.most_common():
    if count > 1:
        indexes_to_be_deleted.append(l)    

for i in sorted(indexes_to_be_deleted, reverse=True): 
    del folderlist[i]

Output:
c:\\temp\\data\\1122 AA\\
c:\\temp\\server1\\div\\2244_BB\\
c:\\temp\\123456789-BB\\

The output is as expected, only the top-level folder from each group of folders. 输出是预期的，只是每组文件夹中的顶级文件夹。 However, I hope some of you have an idea how to make the routine faster. 但是，我希望你们中的一些人知道如何更快地完成例程。

Answer 1

I would suggest adding to a new list rather than removing items: 我建议添加到新列表而不是删除项目：

topFolders = [] 
for name in folderlist:  # sorted(folderlist) if they are not already in order
    if topFolders and name.startswith(topFolders[-1]+"\\"): continue
    topFolders.append(name)

you can assign it to the original list if necessary 如有必要，您可以将其分配给原始列表

folderlist = topFolders

Answer 2

I thought I would post my somewhat over-engineered, recursive, tree-based solution since (a) I wrote it before seeing (and upvoting) Alain T.'s answer, and (b) because I think it should be asymptotically faster for unsorted input ( O(n) vs O(n.log(n)) ) than sorting the list - though for mere thousands of paths sorting may well be faster than all this hashing etc. 我想我会后我有点过度设计，递归，基于树的解决方案，因为（一）我看到（和upvoting）阿兰·T.的回答，和（b），因为我认为这应该是渐进更快之前写的未排序的输入（ O(n) vs O(n.log(n)) ）而不是排序列表 - 尽管只有数千个路径排序可能比所有这些散列等更快。

from collections import defaultdict

def new_node():
    return defaultdict(new_node)

def insert_into_tree(tree, full_path, split_path):
    top_dir, *rest_of_path = split_path

    if isinstance(tree[top_dir], str):
        # A shorter path is already in the tree! Throw this path away.
        return None

    if not rest_of_path:
        # Store the full path at this leaf.
        tree[top_dir] = full_path
        return full_path

    return insert_into_tree(tree[top_dir], full_path, rest_of_path)

def get_shortest_paths(tree, paths):
    for dir_name, child in tree.items():
        if isinstance(child, str):
            paths.append(child)
        else:
            get_shortest_paths(child, paths)

folder_list = [ ... ]
folder_tree = new_node()

for full_path in folder_list:
    insert_into_tree(folder_tree, full_path, full_path.split("\\"))

shortest_paths = []
get_shortest_paths(folder_tree, shortest_paths)

print(shortest_paths)

清理文件夹列表，仅保留每组文件夹中的顶级文件夹

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-02-12 23:44:56

解决方案2
0 2019-02-13 00:30:48

清理文件夹列表，仅保留每组文件夹中的顶级文件夹

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-02-12 23:44:56

解决方案2 0 2019-02-13 00:30:48

解决方案1
2 已采纳 2019-02-12 23:44:56

解决方案2
0 2019-02-13 00:30:48