简体   繁体   中英

Diff two folders using python - having same set of subfolders and file structures

i am trying to write a function in python for comparing two folders (with exactly same subdirectory structure and file lists).

The folder is bound to contain .c .h .txt .cat .sys .pdb, but the major concentration is on C and header files.

The output of this diff(folder1, folder2) should return the following

  1. print new added driver files (c and .h alone) in folder2

  2. print deleted driver files in folder 2 as compared to folder 1

(this can be done by two lever for loop storing os.walk results in two lists and subsequently subtracting them )

  1. my challenge :diff() function should start comparing files in all subdirectories and if it finds even one (c or h) files modified in folder2, it should end the diff routine and return a FLAG = 1

eg.

folder1\\foo\\bar\\1.c folder2\\foo\\bar\\1.c --> same

folder1\\foo\\car\\2.c folder2\\foo\\car\\2.c --> same

folder1\\foo\\dar\\13.c folder2\\foo\\dar\\13.c --> different -> return flag=1

folder1\\foo\\far\\211.c folder2\\foo\\far\\211.c --> not compared

I have tried to use os.walk(path) function to do this and store all the files in two seperate list. but i find it extremely long and complicated for multiple files in this location.

Also, if there is a method to ignore perforce headers, comments, extra spacing in comparison it would enhance my script

Any advice greatly appreciated

Here's a rough outline of what I would do:

  • Create a set() of folder1 file names folder1_set
  • os.walk() is your friend, load up folder1 and start to walk the files.
  • for each file, open it. Modifying the path to point to the second folder, open the 2nd file and check for equality
  • Add the file name to folder1_set
  • Finally, we have to keep a track of files which could be missing in folder2. Using os.walk() on folder2, you could keep another set() of filenames folder2_set
  • folder1_set - folder2_set would give you any items in folder1, not in folder2 and vice versa would give you the opposite.

Edit 1

  • To prepare for comments you can read the file in line by line, using .strip() on the line to remove whitespace at either end, then check for the presence of /* at the start of the stripped line
  • For single line comments you must also check if */ is present at the end of the line and if so, exclude it.
  • For multi line comments you can ignore all lines until you come across */ at the end of the line.
  • This also has the benefit of clearing out empty lines (removes all spaces).
  • Check out os.path.splitext to easily find the file extensions whilst walk ing.

You should be able to accomplish this using Python's filecmp library.

EDITED ANSWER

Addressing additional comments by @DennisNinj

Thanks, Is there anyway to include .c and .h for file comparison? I have more than 20 types of files in each folder and over 1000 files in each folder? – Dennis Ninj

@DennisNinj Yes, it's possible, just a bit more tricky. The currently published version of filecmp.dircmp does not support wildcard or regex matching for its "ignore" and "hide" filters. (There's been a patch submitted to support wildcards in dircmp.) So it means you have to do the filtering manually.

Here is an updated example that gets you closer to what you're looking to accomplish. ATTENTION : Do note that due to the requirement to stop method execution once a differing C or header file is found, there's a possibility you won't get the file listings of every "added / deleted driver" available in the compared directories since it may not have had a chance to traverse all sub directories.

ccodedircomparison.py

import re
from filecmp import dircmp


def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "FLAG = 1"


def diffs_found(dcmp):
    c_files_regex = re.compile(r".*\.[ch]$")
    deleted_drivers = []
    if len(dcmp.left_only) > 0:
        for left_only_file in dcmp.left_only:
            c_files_match = c_files_regex.match(left_only_file)
            if c_files_match:
                deleted_drivers.append(left_only_file)
        if len(deleted_drivers) > 0:
            print "Drivers deleted from {dirname}: [{deleted_drivers_list}]".format(dirname=dcmp.right, deleted_drivers_list=', '.join(deleted_drivers))
    added_drivers = []
    if len(dcmp.right_only) > 0:
        for right_only_file in dcmp.right_only:
            c_files_match = c_files_regex.match(right_only_file)
            if c_files_match:
                added_drivers.append(left_only_file)
        if len(added_drivers) > 0:
            print "Drivers added to {dirname}: [{added_drivers_list}]".format(dirname=dcmp.right, added_drivers_list=', '.join(dcmp.right_only))
    if len(dcmp.diff_files) > 0:
        differing_c_files = []
        for diff_file in dcmp.diff_files:
            c_files_match = c_files_regex.match(diff_file)
            if c_files_match:
                differing_c_files.append(diff_file)
        if len(differing_c_files) > 0:
            print "C files whose content differs ({dirname}): [{differing_c_files}]".format(dirname=dcmp.right, differing_c_files=', '.join(differing_c_files))
            return True
    for sub_dcmp in dcmp.subdirs.values():
        return diffs_found(sub_dcmp)
    return False

if __name__ == '__main__':
    main()

Example output

(.virtualenvs)macbook:dircomparison joeyoung$ python ccodedircomparison.py 
Drivers deleted from /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [thisismissingfromtest2.c]
Drivers added to /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [addedfile1.h]
C files whose content differs (/Users/joeyoung/web/stackoverflow/dircomparison/test2/support): [samefilenamedifftext1.h, samefilename1.c]
FLAG = 1

Test environment directory structure

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1 test2
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   ├── package.json
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   ├── script.sql
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── trace.test.js
└── unicode.test.js
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── addedfile1.h
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   └── script.sql
├── trace.test.js
└── unicode.test.js

ORIGINAL ANSWER BEFORE THE EDIT IS BELOW

My example doesn't do exactly what you describe, but there should be enough between this example and the filecmp.dircmp() documentation to get you started.

dircomparison.py

from filecmp import dircmp

def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "DIFFS FOUND!"
    else:
        print "NO DIFFS FOUND"


def diffs_found(dcmp):
    if len(dcmp.left_only) > 0:
        print dcmp.report_full_closure()
        return True
    elif len(dcmp.right_only) > 0:
        print dcmp.report_full_closure()
        return True
    else:
        for sub_dcmp in dcmp.subdirs.values():
            if diffs_found(sub_dcmp):
                return True
    return False

if __name__ == '__main__':
    main()

Example output

(.virtualenvs)macbook:dircomparison joeyoung$ python dircomparison.py 
diff /Users/joeyoung/web/stackoverflow/dircomparison/test1/support /Users/joeyoung/web/stackoverflow/dircomparison/test2/support
Only in /Users/joeyoung/web/stackoverflow/dircomparison/test1/support : ['onlyintest1.txt']
Identical files : ['createdb.js', 'elmo.png', 'helper.js', 'prepare.db', 'script.sql']
None
DIFFS FOUND!

The actual directory structures So you can see what my test environment looked like.

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 33 files
(.virtualenvs)macbook:dircomparison joeyoung$ tree test2
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 32 files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM