[英]Python: Group files from different directories with same basename
I am struggling with a particularly frustrating task. 我正在努力完成一项特别令人沮丧的任务。
I have a set of thousands of files in one directory, say /path/to/file#####.txt
. 我在一个目录中有成千上万个文件,例如/path/to/file#####.txt
。 In other directories I have (probably the same number of) files with the same base-names but different suffixes, eg /diff/path/to/file#####.txt.foo
. 在其他目录中,我有(可能是相同数量的)具有相同基名但后缀不同的文件,例如/diff/path/to/file#####.txt.foo
。
I am trying to group these files together so that I have a list of lists as 我正在尝试将这些文件分组在一起,以便我有一个列表列表
[['/path/to/file#####.txt', '/diff/path/to/file#####.txt.foo',
'/another/path/to/file#####.txt.bar'], ...]
It is likely, but not guaranteed that there is a corresponding file in each subforlder, but it's not guaranteed. 可能但不保证每个子索引中都有一个相应的文件,但不能保证。 In other words, '/path/to/file#####.txt'
may exist but '/diff/path/to/file#####.txt.foo'
might not, so I need to skip that base-name if this occurs. 换句话说, '/path/to/file#####.txt'
'/diff/path/to/file#####.txt.foo'
可能存在,但'/diff/path/to/file#####.txt.foo'
可能不存在,因此我需要跳过基本名称(如果发生这种情况)。
My purpose here is to create a file list for synchronized data loading. 我的目的是创建用于同步数据加载的文件列表。
How can I efficiently do this? 我如何有效地做到这一点?
I ended coming up with a solution that's fairly efficient but it does not appear to be the most elegant. 我最终提出了一个相当有效的解决方案,但是它似乎并不是最优雅的。
Basically, I first find all file basenames and form a list of lists of all possible sets, eg 基本上,我首先找到所有文件基名,并形成所有可能集的列表列表,例如
groups = [['/path/to/file00000.txt', '/diff/path/to/file00000.txt.foo',
'/another/path/to/file00000.txt.bar'],
['/path/to/file00001.txt', '/diff/path/to/file00001.txt.foo',
'/another/path/to/file00001.txt.bar'], ...]
Then I check for the existence of the file with a given basename in each directory using os.path.exists()
, like 然后我使用os.path.exists()
检查每个目录中是否存在具有给定基名的文件,例如
del_idx = []
for i in xrange(len(groups)):
for j in len(groups[i]):
if not os.path.exists(groups[i][j]):
del_idx.append(i)
continue # Because if one doesn't exist, no need to check others
Now that I have a list of indices that are "bad", I just loop through in reverse to delete them. 现在,我有了一个“坏”索引的列表,我只是反向循环以删除它们。
for i in xrange(len(del_idx)-1,-1,-1):
groups.pop(del_idx[i])
This works fine in my case where I only have 3-tuples, but if there are a significant number of paths in the tuple, this would probably break down. 在我只有3个元组的情况下,这很好用,但是如果元组中有很多路径,则可能会损坏。
For ~260k files the all-groups construction took ~12 sec, the existence check took ~35 sec, and the deletion took ~12 sec. 对于约26万个文件,所有组的构建花费了约12秒,存在检查花费了约35秒,而删除花费了约12秒。 This is fairly reasonable, but, again, this algorithm is O(m*n) for m files and groups of size n, so it's not ideal if group sizes get large. 这是相当合理的,但是同样,对于m个文件和大小为n的组,此算法为O(m * n),因此如果组大小变大,则不是理想的。
My proposed solution: 我建议的解决方案:
import glob
import os.path as op
from collections import defaultdict
def variable_part(file, base, ext):
return file[len(base):-len(ext)-1]
def func(dirs):
base = 'file'
files = defaultdict(list)
dirext = []
for d in dirs:
local_files = glob.glob(op.join(d, '*'))
local_ext = '.'.join(local_files[0].split('.')[1:])
for f in local_files:
files[variable_part(op.basename(f), base, local_ext)].append(f)
return list(files.values())
Haven't profiled it, my feeling however is that it's close-to-optimal, each filename is processed once, and after the first directory any access to files
should have been amortised already. 尚未对其进行概要分析,但是我的感觉是它接近最佳,每个文件名仅处理一次,并且在第一个目录之后,对files
任何访问都应已摊销。 Some additional optimisation is definitely possible, especially in the handling of strings. 肯定可以进行一些其他优化,尤其是在字符串处理中。
If the variable part are just integers from 0 to M-1, it may be optimal to have a series of M lists X_k of length N, if you have N directories; 如果变量部分只是从0到M-1的整数,则最好有一系列M个长度为N的列表X_k(如果您有N个目录)。 Each X_k[i] is set to 1 or 0 according to the existence or not of the file file k .xxx in the i-th directory. 根据第i个目录中文件文件k .xxx的存在与否,将每个X_k [i]设置为1或0。 Only then you produce the final filenames list, removing the need for deletions (which, as you may have noticed, is an expensive operation for a list). 只有这样,您才能生成最终的文件名列表,而无需删除(您可能已经注意到,删除列表的操作很昂贵)。
In any case, the minimum complexity for this algorithm is N*M, in no way you can get away from going in each directory and check all the files; 无论如何,该算法的最小复杂度为N * M,因此您绝对无法避免进入每个目录并检查所有文件。 those 35 sec may be optimized with a single system call for getting all the directory, and then working in memory, but that does not change the overall complexity, ie how the algorithm scales. 可以通过单个系统调用来优化这35秒,以获取所有目录,然后在内存中工作,但这不会改变整体复杂度,即算法的扩展方式。
Edit I was kinda curious on this, and I made a test. 编辑我对此有些好奇,我做了一个测试。 Indeed, apparently working on the filenames retrieved by glob seems faster than checking each file for existence (at least on my mac HFS+ filesystem, on ssd). 确实,显然,处理由glob检索到的文件名似乎比检查每个文件是否存在(至少在我的Mac HFS +文件系统上,在ssd上)要快。
In [0]: def x():
...: return [os.path.exists('test1/file%06d.txt.gz' % i) for i in range(10000)]
...:
In [1]: def y():
...: ff = glob.glob('test1/*')
...: res = [False]*10000
...: for s in ff:
...: res[int(s[10:16])] = True
...: return res
...:
In [2]: %timeit x()
10 loops, best of 3: 71.2 ms per loop
In [3]: %timeit y()
10 loops, best of 3: 32.6 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.