简体   繁体   English

Python:将具有相同基名的不同目录中的文件分组

[英]Python: Group files from different directories with same basename

I am struggling with a particularly frustrating task. 我正在努力完成一项特别令人沮丧的任务。

I have a set of thousands of files in one directory, say /path/to/file#####.txt . 我在一个目录中有成千上万个文件,例如/path/to/file#####.txt In other directories I have (probably the same number of) files with the same base-names but different suffixes, eg /diff/path/to/file#####.txt.foo . 在其他目录中,我有(可能是相同数量的)具有相同基名但后缀不同的文件,例如/diff/path/to/file#####.txt.foo

I am trying to group these files together so that I have a list of lists as 我正在尝试将这些文件分组在一起,以便我有一个列表列表

[['/path/to/file#####.txt', '/diff/path/to/file#####.txt.foo', 
  '/another/path/to/file#####.txt.bar'], ...]

It is likely, but not guaranteed that there is a corresponding file in each subforlder, but it's not guaranteed. 可能但不保证每个子索引中都有一个相应的文件,但不能保证。 In other words, '/path/to/file#####.txt' may exist but '/diff/path/to/file#####.txt.foo' might not, so I need to skip that base-name if this occurs. 换句话说, '/path/to/file#####.txt' '/diff/path/to/file#####.txt.foo'可能存在,但'/diff/path/to/file#####.txt.foo'可能不存在,因此我需要跳过基本名称(如果发生这种情况)。

My purpose here is to create a file list for synchronized data loading. 我的目的是创建用于同步数据加载的文件列表。

How can I efficiently do this? 我如何有效地做到这一点?

I ended coming up with a solution that's fairly efficient but it does not appear to be the most elegant. 我最终提出了一个相当有效的解决方案,但是它似乎并不是最优雅的。

Basically, I first find all file basenames and form a list of lists of all possible sets, eg 基本上,我首先找到所有文件基名,并形成所有可能集的列表列表,例如

groups = [['/path/to/file00000.txt', '/diff/path/to/file00000.txt.foo', 
  '/another/path/to/file00000.txt.bar'],
  ['/path/to/file00001.txt', '/diff/path/to/file00001.txt.foo', 
  '/another/path/to/file00001.txt.bar'], ...]

Then I check for the existence of the file with a given basename in each directory using os.path.exists() , like 然后我使用os.path.exists()检查每个目录中是否存在具有给定基名的文件,例如

del_idx = []
for i in xrange(len(groups)):
    for j in len(groups[i]):
        if not os.path.exists(groups[i][j]):
            del_idx.append(i)
            continue # Because if one doesn't exist, no need to check others

Now that I have a list of indices that are "bad", I just loop through in reverse to delete them. 现在,我有了一个“坏”索引的列表,我只是反向循环以删除它们。

for i in xrange(len(del_idx)-1,-1,-1):
    groups.pop(del_idx[i])

This works fine in my case where I only have 3-tuples, but if there are a significant number of paths in the tuple, this would probably break down. 在我只有3个元组的情况下,这很好用,但是如果元组中有很多路径,则可能会损坏。

For ~260k files the all-groups construction took ~12 sec, the existence check took ~35 sec, and the deletion took ~12 sec. 对于约26万个文件,所有组的构建花费了约12秒,存在检查花费了约35秒,而删除花费了约12秒。 This is fairly reasonable, but, again, this algorithm is O(m*n) for m files and groups of size n, so it's not ideal if group sizes get large. 这是相当合理的,但是同样,对于m个文件和大小为n的组,此算法为O(m * n),因此如果组大小变大,则不是理想的。

My proposed solution: 我建议的解决方案:

import glob
import os.path as op
from collections import defaultdict
def variable_part(file, base, ext):
    return file[len(base):-len(ext)-1]
def func(dirs):
    base = 'file'
    files = defaultdict(list)
    dirext = []
    for d in dirs:
        local_files = glob.glob(op.join(d, '*'))
        local_ext = '.'.join(local_files[0].split('.')[1:])
        for f in local_files:
            files[variable_part(op.basename(f), base, local_ext)].append(f)
    return list(files.values())

Haven't profiled it, my feeling however is that it's close-to-optimal, each filename is processed once, and after the first directory any access to files should have been amortised already. 尚未对其进行概要分析,但是我的感觉是它接近最佳,每个文件名仅处理一次,并且在第一个目录之后,对files任何访问都应已摊销。 Some additional optimisation is definitely possible, especially in the handling of strings. 肯定可以进行一些其他优化,尤其是在字符串处理中。

If the variable part are just integers from 0 to M-1, it may be optimal to have a series of M lists X_k of length N, if you have N directories; 如果变量部分只是从0到M-1的整数,则最好有一系列M个长度为N的列表X_k(如果您有N个目录)。 Each X_k[i] is set to 1 or 0 according to the existence or not of the file file k .xxx in the i-th directory. 根据第i个目录中文件文件k .xxx的存在与否,将每个X_k [i]设置为1或0。 Only then you produce the final filenames list, removing the need for deletions (which, as you may have noticed, is an expensive operation for a list). 只有这样,您才能生成最终的文件名列表,而无需删除(您可能已经注意到,删除列表的操作很昂贵)。

In any case, the minimum complexity for this algorithm is N*M, in no way you can get away from going in each directory and check all the files; 无论如何,该算法的最小复杂度为N * M,因此您绝对无法避免进入每个目录并检查所有文件。 those 35 sec may be optimized with a single system call for getting all the directory, and then working in memory, but that does not change the overall complexity, ie how the algorithm scales. 可以通过单个系统调用来优化这35秒,以获取所有目录,然后在内存中工作,但这不会改变整体复杂度,即算法的扩展方式。

Edit I was kinda curious on this, and I made a test. 编辑我对此有些好奇,我做了一个测试。 Indeed, apparently working on the filenames retrieved by glob seems faster than checking each file for existence (at least on my mac HFS+ filesystem, on ssd). 确实,显然,处理由glob检索到的文件名似乎比检查每个文件是否存在(至少在我的Mac HFS +文件系统上,在ssd上)要快。

In [0]: def x():
     ...:     return [os.path.exists('test1/file%06d.txt.gz' % i) for i in range(10000)]
     ...:

In [1]: def y():
     ...:     ff = glob.glob('test1/*')
     ...:     res = [False]*10000
     ...:     for s in ff:
     ...:         res[int(s[10:16])] = True
     ...:     return res
     ...:

In [2]: %timeit x()
10 loops, best of 3: 71.2 ms per loop

In [3]: %timeit y()
10 loops, best of 3: 32.6 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python 读取具有相同基本名称的 csv 文件并保存为不同的数据帧 - python read csv files with same basename and save as different dataframes 来自两个不同目录的文件的 Python 看门狗 - Python watchdog for files from two different directories “python -m doctest”忽略不同目录中具有相同名称的文件 - “python -m doctest” ignores files with same names in different directories python:从不同目录运行相同代码时,结果不同 - python: different results when running the same code from different directories 融合来自不同目录的许多具有相同名称的 csv 文件 - Fuse many csv files with same names from different directories 如何比较来自不同目录的同一索引处的文件 - How to compare files at the same index from different directories 在Python中的不同目录中导入文件 - Importing files in different directories in Python Python:从多个位置复制目录,而不是文件,如果名称相同,则覆盖 - Python: Copy directories, not files, from multiple locations, overwriting if same name 如何 select 来自不同目录的每 20 个文件,Python? - How to select each 20 files from different directories, Python? 从不同目录中的多个文件中读取特定数据 python - Read specific data from multiple files in different directories python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM