简体   繁体   English

将两个目录中的文件与 python 进行比较,以查找在一个目录中但不在另一个目录中的文件 - 与子目录结构无关

[英]Compare files in two directories with python to look for files that are in one directory but not the other -agnostic to subdirectory structure

Trying to compare our current project media server (dir1) with a backup (dir2) to see what documents were deleted.尝试将我们当前的项目媒体服务器 (dir1) 与备份 (dir2) 进行比较,以查看删除了哪些文件。 Both are windows directories.两者都是 windows 目录。 Many of the files have been shuffled around into new sub-directories but are not missing.许多文件已被改组到新的子目录中,但并没有丢失。 Because the directory structure has changed using recursion and filecmp.dircmp per this post won't work: Recursively compare two directories to ensure they have the same files and subdirectories因为使用递归和 filecmp.dircmp 更改了目录结构,所以这篇文章将不起作用: 递归比较两个目录以确保它们具有相同的文件和子目录

The other considerations is that different files will have the same file name, so comparison will need to compare file size, modification date, etc to determine if two files are the same.另一个考虑是不同的文件会有相同的文件名,所以比较需要比较文件大小、修改日期等来确定两个文件是否相同。

What I want sudo-code:我想要的 sudo 代码:

def find_missing_files(currentDir, backup):
    <does stuff>
    return <List of Files in backup that are not in currentDir>

What I have:我有的:

def build_file_list(someDir, fileList = []):
    for root, dirs, files in os.walk(someDir):
        if files:
            for file in files:
                filePath = os.path.join(root, file)
                if filePath not in fileList:
                    fileList.append(filePath)
    return fileList

def cmp_file_lists(dir1, dir2):
    dir1List = build_file_list(dir1)
    dir2List = build_file_list(dir2)

    for dir2file in dir2List:
        for dir1file in dir1List:
            if filecmp.cmp(dir1file, dir2file):
                dir1List.remove(dir1file)
                dir2List.remove(dir2file)
                break
    return (dir1List, dir2List)

EDIT: in above code I am having an issue where dir2List.remove(dir2file) throw error that dir2file is not in dir2List because (it appears) somehow both dir2list and dir1List are the same object.编辑:在上面的代码中,我遇到了一个问题,即 dir2List.remove(dir2file) 抛出 dir2file 不在 dir2List 中的错误,因为(它似乎)在某种程度上 dir2list 和 dir1List 都是相同的 object。 Dunno how that is happening.不知道这是怎么发生的。

I don't know if this could more easily be done with filecmp.dircmp but I am missing it?我不知道这是否可以通过 filecmp.dircmp 更轻松地完成,但我错过了它? or if this is the best approach to achieve what I am looking for?或者如果这是实现我正在寻找的最佳方法? ...or should I take each file from dir2 and us os.walk to look for it in dir1? ...或者我应该从 dir2 和我们的 os.walk 中获取每个文件以在 dir1 中查找它?

May I suggest an alternative?我可以建议一个替代方案吗? Using pathlib and it's rglob method, everything is much easier (if you really are agnostic about subdirectories):使用pathlib和它的rglob方法,一切都容易得多(如果你真的不知道子目录):

from pathlib import Path

def cmp_file_lists(dir1, dir2):
    dir1_filenames = set(f.name for f in Path(dir1).rglob('*'))
    dir2_filenames = set(f.name for f in Path(dir2).rglob('*'))
    files_in_dir1_but_not_dir2 = dir1_filenames - dir2_filenames 
    files_in_dir2_but_not_dir1 = dir2_filenames - dir1_filenames 
    return dir1_filenames, dir2_filenames

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM