Python：将具有相同基名的不同目录中的文件分组

Question

I am struggling with a particularly frustrating task. 我正在努力完成一项特别令人沮丧的任务。

I have a set of thousands of files in one directory, say /path/to/file#####.txt . 我在一个目录中有成千上万个文件，例如/path/to/file#####.txt 。 In other directories I have (probably the same number of) files with the same base-names but different suffixes, eg /diff/path/to/file#####.txt.foo . 在其他目录中，我有（可能是相同数量的）具有相同基名但后缀不同的文件，例如/diff/path/to/file#####.txt.foo 。

I am trying to group these files together so that I have a list of lists as 我正在尝试将这些文件分组在一起，以便我有一个列表列表

[['/path/to/file#####.txt', '/diff/path/to/file#####.txt.foo', 
  '/another/path/to/file#####.txt.bar'], ...]

It is likely, but not guaranteed that there is a corresponding file in each subforlder, but it's not guaranteed. 可能但不保证每个子索引中都有一个相应的文件，但不能保证。 In other words, '/path/to/file#####.txt' may exist but '/diff/path/to/file#####.txt.foo' might not, so I need to skip that base-name if this occurs. 换句话说， '/path/to/file#####.txt' '/diff/path/to/file#####.txt.foo'可能存在，但'/diff/path/to/file#####.txt.foo'可能不存在，因此我需要跳过基本名称（如果发生这种情况）。

My purpose here is to create a file list for synchronized data loading. 我的目的是创建用于同步数据加载的文件列表。

How can I efficiently do this? 我如何有效地做到这一点？

Answer 1

I ended coming up with a solution that's fairly efficient but it does not appear to be the most elegant. 我最终提出了一个相当有效的解决方案，但是它似乎并不是最优雅的。

Basically, I first find all file basenames and form a list of lists of all possible sets, eg 基本上，我首先找到所有文件基名，并形成所有可能集的列表列表，例如

groups = [['/path/to/file00000.txt', '/diff/path/to/file00000.txt.foo', 
  '/another/path/to/file00000.txt.bar'],
  ['/path/to/file00001.txt', '/diff/path/to/file00001.txt.foo', 
  '/another/path/to/file00001.txt.bar'], ...]

Then I check for the existence of the file with a given basename in each directory using os.path.exists() , like 然后我使用os.path.exists()检查每个目录中是否存在具有给定基名的文件，例如

del_idx = []
for i in xrange(len(groups)):
    for j in len(groups[i]):
        if not os.path.exists(groups[i][j]):
            del_idx.append(i)
            continue # Because if one doesn't exist, no need to check others

Now that I have a list of indices that are "bad", I just loop through in reverse to delete them. 现在，我有了一个“坏”索引的列表，我只是反向循环以删除它们。

for i in xrange(len(del_idx)-1,-1,-1):
    groups.pop(del_idx[i])

This works fine in my case where I only have 3-tuples, but if there are a significant number of paths in the tuple, this would probably break down. 在我只有3个元组的情况下，这很好用，但是如果元组中有很多路径，则可能会损坏。

For ~260k files the all-groups construction took ~12 sec, the existence check took ~35 sec, and the deletion took ~12 sec. 对于约26万个文件，所有组的构建花费了约12秒，存在检查花费了约35秒，而删除花费了约12秒。 This is fairly reasonable, but, again, this algorithm is O(m*n) for m files and groups of size n, so it's not ideal if group sizes get large. 这是相当合理的，但是同样，对于m个文件和大小为n的组，此算法为O（m * n），因此如果组大小变大，则不是理想的。

Answer 2

My proposed solution: 我建议的解决方案：

import glob
import os.path as op
from collections import defaultdict
def variable_part(file, base, ext):
    return file[len(base):-len(ext)-1]
def func(dirs):
    base = 'file'
    files = defaultdict(list)
    dirext = []
    for d in dirs:
        local_files = glob.glob(op.join(d, '*'))
        local_ext = '.'.join(local_files[0].split('.')[1:])
        for f in local_files:
            files[variable_part(op.basename(f), base, local_ext)].append(f)
    return list(files.values())

Haven't profiled it, my feeling however is that it's close-to-optimal, each filename is processed once, and after the first directory any access to files should have been amortised already. 尚未对其进行概要分析，但是我的感觉是它接近最佳，每个文件名仅处理一次，并且在第一个目录之后，对files任何访问都应已摊销。 Some additional optimisation is definitely possible, especially in the handling of strings. 肯定可以进行一些其他优化，尤其是在字符串处理中。

If the variable part are just integers from 0 to M-1, it may be optimal to have a series of M lists X_k of length N, if you have N directories; 如果变量部分只是从0到M-1的整数，则最好有一系列M个长度为N的列表X_k（如果您有N个目录）。 Each X_k[i] is set to 1 or 0 according to the existence or not of the file file k .xxx in the i-th directory. 根据第i个目录中文件文件k .xxx的存在与否，将每个X_k [i]设置为1或0。 Only then you produce the final filenames list, removing the need for deletions (which, as you may have noticed, is an expensive operation for a list). 只有这样，您才能生成最终的文件名列表，而无需删除（您可能已经注意到，删除列表的操作很昂贵）。

In any case, the minimum complexity for this algorithm is N*M, in no way you can get away from going in each directory and check all the files; 无论如何，该算法的最小复杂度为N * M，因此您绝对无法避免进入每个目录并检查所有文件。 those 35 sec may be optimized with a single system call for getting all the directory, and then working in memory, but that does not change the overall complexity, ie how the algorithm scales. 可以通过单个系统调用来优化这35秒，以获取所有目录，然后在内存中工作，但这不会改变整体复杂度，即算法的扩展方式。

Edit I was kinda curious on this, and I made a test. 编辑我对此有些好奇，我做了一个测试。 Indeed, apparently working on the filenames retrieved by glob seems faster than checking each file for existence (at least on my mac HFS+ filesystem, on ssd). 确实，显然，处理由glob检索到的文件名似乎比检查每个文件是否存在（至少在我的Mac HFS +文件系统上，在ssd上）要快。

In [0]: def x():
     ...:     return [os.path.exists('test1/file%06d.txt.gz' % i) for i in range(10000)]
     ...:

In [1]: def y():
     ...:     ff = glob.glob('test1/*')
     ...:     res = [False]*10000
     ...:     for s in ff:
     ...:         res[int(s[10:16])] = True
     ...:     return res
     ...:

In [2]: %timeit x()
10 loops, best of 3: 71.2 ms per loop

In [3]: %timeit y()
10 loops, best of 3: 32.6 ms per loop

Python：将具有相同基名的不同目录中的文件分组

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-10-28 01:56:24

解决方案2
1 2017-10-28 04:40:30

Python：将具有相同基名的不同目录中的文件分组

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-10-28 01:56:24

解决方案2 1 2017-10-28 04:40:30

解决方案1
1 已采纳 2017-10-28 01:56:24

解决方案2
1 2017-10-28 04:40:30