简体   繁体   中英

Compare files in two directories with python to look for files that are in one directory but not the other -agnostic to subdirectory structure

Trying to compare our current project media server (dir1) with a backup (dir2) to see what documents were deleted. Both are windows directories. Many of the files have been shuffled around into new sub-directories but are not missing. Because the directory structure has changed using recursion and filecmp.dircmp per this post won't work: Recursively compare two directories to ensure they have the same files and subdirectories

The other considerations is that different files will have the same file name, so comparison will need to compare file size, modification date, etc to determine if two files are the same.

What I want sudo-code:

def find_missing_files(currentDir, backup):
    <does stuff>
    return <List of Files in backup that are not in currentDir>

What I have:

def build_file_list(someDir, fileList = []):
    for root, dirs, files in os.walk(someDir):
        if files:
            for file in files:
                filePath = os.path.join(root, file)
                if filePath not in fileList:
                    fileList.append(filePath)
    return fileList

def cmp_file_lists(dir1, dir2):
    dir1List = build_file_list(dir1)
    dir2List = build_file_list(dir2)

    for dir2file in dir2List:
        for dir1file in dir1List:
            if filecmp.cmp(dir1file, dir2file):
                dir1List.remove(dir1file)
                dir2List.remove(dir2file)
                break
    return (dir1List, dir2List)

EDIT: in above code I am having an issue where dir2List.remove(dir2file) throw error that dir2file is not in dir2List because (it appears) somehow both dir2list and dir1List are the same object. Dunno how that is happening.

I don't know if this could more easily be done with filecmp.dircmp but I am missing it? or if this is the best approach to achieve what I am looking for? ...or should I take each file from dir2 and us os.walk to look for it in dir1?

May I suggest an alternative? Using pathlib and it's rglob method, everything is much easier (if you really are agnostic about subdirectories):

from pathlib import Path

def cmp_file_lists(dir1, dir2):
    dir1_filenames = set(f.name for f in Path(dir1).rglob('*'))
    dir2_filenames = set(f.name for f in Path(dir2).rglob('*'))
    files_in_dir1_but_not_dir2 = dir1_filenames - dir2_filenames 
    files_in_dir2_but_not_dir1 = dir2_filenames - dir1_filenames 
    return dir1_filenames, dir2_filenames

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM