[英]Cleaning up a list of folders, only keeping the top-level folder from each group of folders
I have just started programming Python and hope some of you experienced could give me a hint about how to optimizing the code below. 我刚刚开始编程Python,希望你们中的一些人可以给我一个如何优化下面代码的提示。
What I am trying to do is to go through a list of folders making a new list only containing the top-level folder from each group of folders. 我要做的是浏览一个文件夹列表,创建一个新列表,其中只包含每组文件夹中的顶级文件夹。
I have struggled and written the code below, which does the job, but scale terribly when used lists containing thousands of folders. 我一直在努力并编写下面的代码,它完成了这项工作,但是当使用包含数千个文件夹的列表时,它的扩展性非常大。
Any ides how to optimize this routine are most welcome. 任何想法如何优化这个例程是非常受欢迎的。
folderlist = [ "c:\\temp\\data\\1122 AA",\
"c:\\temp\\data\\1122 AA\\Div",\
"c:\\temp\\data\\1122 AA\\Div\\Etc",\
"c:\\temp\\data\\1122 AA\\Div\\Etc2",\
"c:\\temp\\server1\\div\\2244_BB",\
"c:\\temp\\server1\\div\\2244_BB\\pp",\
"c:\\temp\\server1\\div\\2244_BB\\der\\dedd",\
"c:\\temp\\server1\\div\\2244_BB\\defwe23d\\23ded",\
"c:\\temp\\123456789-BB",\
"c:\\temp\\123456789-BB\\pp",\
"c:\\temp\\123456789-BB\\der\\dee32d",\
"c:\\temp\\data\\123456789-BB\\ded\\ve_23"]
l2 = folderlist.copy()
ind = []
indexes_to_be_deleted = []
for el in l2:
for idx, x in enumerate(l2):
if el in x:
ind.append(idx)
counts = Counter(ind)
for l, count in counts.most_common():
if count > 1:
indexes_to_be_deleted.append(l)
for i in sorted(indexes_to_be_deleted, reverse=True):
del folderlist[i]
Output:
c:\\temp\\data\\1122 AA\\
c:\\temp\\server1\\div\\2244_BB\\
c:\\temp\\123456789-BB\\
The output is as expected, only the top-level folder from each group of folders. 输出是预期的,只是每组文件夹中的顶级文件夹。 However, I hope some of you have an idea how to make the routine faster. 但是,我希望你们中的一些人知道如何更快地完成例程。
I would suggest adding to a new list rather than removing items: 我建议添加到新列表而不是删除项目:
topFolders = []
for name in folderlist: # sorted(folderlist) if they are not already in order
if topFolders and name.startswith(topFolders[-1]+"\\"): continue
topFolders.append(name)
you can assign it to the original list if necessary 如有必要,您可以将其分配给原始列表
folderlist = topFolders
I thought I would post my somewhat over-engineered, recursive, tree-based solution since (a) I wrote it before seeing (and upvoting) Alain T.'s answer, and (b) because I think it should be asymptotically faster for unsorted input ( O(n)
vs O(n.log(n))
) than sorting the list - though for mere thousands of paths sorting may well be faster than all this hashing etc. 我想我会后我有点过度设计,递归,基于树的解决方案,因为(一)我看到(和upvoting)阿兰·T.的回答,和(b),因为我认为这应该是渐进更快之前写的未排序的输入( O(n)
vs O(n.log(n))
)而不是排序列表 - 尽管只有数千个路径排序可能比所有这些散列等更快。
from collections import defaultdict
def new_node():
return defaultdict(new_node)
def insert_into_tree(tree, full_path, split_path):
top_dir, *rest_of_path = split_path
if isinstance(tree[top_dir], str):
# A shorter path is already in the tree! Throw this path away.
return None
if not rest_of_path:
# Store the full path at this leaf.
tree[top_dir] = full_path
return full_path
return insert_into_tree(tree[top_dir], full_path, rest_of_path)
def get_shortest_paths(tree, paths):
for dir_name, child in tree.items():
if isinstance(child, str):
paths.append(child)
else:
get_shortest_paths(child, paths)
folder_list = [ ... ]
folder_tree = new_node()
for full_path in folder_list:
insert_into_tree(folder_tree, full_path, full_path.split("\\"))
shortest_paths = []
get_shortest_paths(folder_tree, shortest_paths)
print(shortest_paths)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.