i am trying to write a function in python for comparing two folders (with exactly same subdirectory structure and file lists).
The folder is bound to contain .c .h .txt .cat .sys .pdb, but the major concentration is on C and header files.
The output of this diff(folder1, folder2) should return the following
print new added driver files (c and .h alone) in folder2
print deleted driver files in folder 2 as compared to folder 1
(this can be done by two lever for loop storing os.walk results in two lists and subsequently subtracting them )
eg.
folder1\\foo\\bar\\1.c folder2\\foo\\bar\\1.c --> same
folder1\\foo\\car\\2.c folder2\\foo\\car\\2.c --> same
folder1\\foo\\dar\\13.c folder2\\foo\\dar\\13.c --> different -> return flag=1
folder1\\foo\\far\\211.c folder2\\foo\\far\\211.c --> not compared
I have tried to use os.walk(path) function to do this and store all the files in two seperate list. but i find it extremely long and complicated for multiple files in this location.
Also, if there is a method to ignore perforce headers, comments, extra spacing in comparison it would enhance my script
Any advice greatly appreciated
Here's a rough outline of what I would do:
set()
of folder1 file names folder1_set
os.walk()
is your friend, load up folder1 and start to walk the files. folder1_set
os.walk()
on folder2, you could keep another set()
of filenames folder2_set
folder1_set - folder2_set
would give you any items in folder1, not in folder2 and vice versa would give you the opposite. Edit 1
.strip()
on the line to remove whitespace at either end, then check for the presence of /*
at the start of the stripped line */
is present at the end of the line and if so, exclude it. */
at the end of the line. os.path.splitext
to easily find the file extensions whilst walk
ing. You should be able to accomplish this using Python's filecmp library.
EDITED ANSWER
Addressing additional comments by @DennisNinj
Thanks, Is there anyway to include .c and .h for file comparison? I have more than 20 types of files in each folder and over 1000 files in each folder? – Dennis Ninj
@DennisNinj Yes, it's possible, just a bit more tricky. The currently published version of filecmp.dircmp does not support wildcard or regex matching for its "ignore" and "hide" filters. (There's been a patch submitted to support wildcards in dircmp.) So it means you have to do the filtering manually.
Here is an updated example that gets you closer to what you're looking to accomplish. ATTENTION : Do note that due to the requirement to stop method execution once a differing C or header file is found, there's a possibility you won't get the file listings of every "added / deleted driver" available in the compared directories since it may not have had a chance to traverse all sub directories.
ccodedircomparison.py
import re
from filecmp import dircmp
def main():
dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
if diffs_found(dcmp):
print "FLAG = 1"
def diffs_found(dcmp):
c_files_regex = re.compile(r".*\.[ch]$")
deleted_drivers = []
if len(dcmp.left_only) > 0:
for left_only_file in dcmp.left_only:
c_files_match = c_files_regex.match(left_only_file)
if c_files_match:
deleted_drivers.append(left_only_file)
if len(deleted_drivers) > 0:
print "Drivers deleted from {dirname}: [{deleted_drivers_list}]".format(dirname=dcmp.right, deleted_drivers_list=', '.join(deleted_drivers))
added_drivers = []
if len(dcmp.right_only) > 0:
for right_only_file in dcmp.right_only:
c_files_match = c_files_regex.match(right_only_file)
if c_files_match:
added_drivers.append(left_only_file)
if len(added_drivers) > 0:
print "Drivers added to {dirname}: [{added_drivers_list}]".format(dirname=dcmp.right, added_drivers_list=', '.join(dcmp.right_only))
if len(dcmp.diff_files) > 0:
differing_c_files = []
for diff_file in dcmp.diff_files:
c_files_match = c_files_regex.match(diff_file)
if c_files_match:
differing_c_files.append(diff_file)
if len(differing_c_files) > 0:
print "C files whose content differs ({dirname}): [{differing_c_files}]".format(dirname=dcmp.right, differing_c_files=', '.join(differing_c_files))
return True
for sub_dcmp in dcmp.subdirs.values():
return diffs_found(sub_dcmp)
return False
if __name__ == '__main__':
main()
Example output
(.virtualenvs)macbook:dircomparison joeyoung$ python ccodedircomparison.py
Drivers deleted from /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [thisismissingfromtest2.c]
Drivers added to /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [addedfile1.h]
C files whose content differs (/Users/joeyoung/web/stackoverflow/dircomparison/test2/support): [samefilenamedifftext1.h, samefilename1.c]
FLAG = 1
Test environment directory structure
(.virtualenvs)macbook:dircomparison joeyoung$ tree test1 test2
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ ├── package.json
│ ├── thisismissingfromtest2.c
│ └── thisismissingfromtest2.txt
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── onlyintest1.txt
│ ├── prepare.db
│ ├── samefilename1.c
│ ├── samefilename1.txt
│ ├── samefilenamedifftext1.h
│ ├── samefilenamedsametext1.h
│ ├── script.sql
│ ├── thisismissingfromtest2.c
│ └── thisismissingfromtest2.txt
├── trace.test.js
└── unicode.test.js
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── addedfile1.h
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── prepare.db
│ ├── samefilename1.c
│ ├── samefilename1.txt
│ ├── samefilenamedifftext1.h
│ ├── samefilenamedsametext1.h
│ └── script.sql
├── trace.test.js
└── unicode.test.js
ORIGINAL ANSWER BEFORE THE EDIT IS BELOW
My example doesn't do exactly what you describe, but there should be enough between this example and the filecmp.dircmp() documentation to get you started.
dircomparison.py
from filecmp import dircmp
def main():
dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
if diffs_found(dcmp):
print "DIFFS FOUND!"
else:
print "NO DIFFS FOUND"
def diffs_found(dcmp):
if len(dcmp.left_only) > 0:
print dcmp.report_full_closure()
return True
elif len(dcmp.right_only) > 0:
print dcmp.report_full_closure()
return True
else:
for sub_dcmp in dcmp.subdirs.values():
if diffs_found(sub_dcmp):
return True
return False
if __name__ == '__main__':
main()
Example output
(.virtualenvs)macbook:dircomparison joeyoung$ python dircomparison.py
diff /Users/joeyoung/web/stackoverflow/dircomparison/test1/support /Users/joeyoung/web/stackoverflow/dircomparison/test2/support
Only in /Users/joeyoung/web/stackoverflow/dircomparison/test1/support : ['onlyintest1.txt']
Identical files : ['createdb.js', 'elmo.png', 'helper.js', 'prepare.db', 'script.sql']
None
DIFFS FOUND!
The actual directory structures So you can see what my test environment looked like.
(.virtualenvs)macbook:dircomparison joeyoung$ tree test1
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── onlyintest1.txt
│ ├── prepare.db
│ └── script.sql
├── trace.test.js
└── unicode.test.js
2 directories, 33 files
(.virtualenvs)macbook:dircomparison joeyoung$ tree test2
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── prepare.db
│ └── script.sql
├── trace.test.js
└── unicode.test.js
2 directories, 32 files
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.