[英]Diff two folders using python - having same set of subfolders and file structures
i am trying to write a function in python for comparing two folders (with exactly same subdirectory structure and file lists). 我试图在python中编写一个函数来比较两个文件夹(具有完全相同的子目录结构和文件列表)。
The folder is bound to contain .c .h .txt .cat .sys .pdb, but the major concentration is on C and header files. 该文件夹必然包含.c .h .txt .cat .sys .pdb,但是主要集中在C和头文件上。
The output of this diff(folder1, folder2) should return the following 此diff(folder1,folder2)的输出应返回以下内容
print new added driver files (c and .h alone) in folder2 在folder2中打印新添加的驱动程序文件(仅c和.h)
print deleted driver files in folder 2 as compared to folder 1 与文件夹1相比,在文件夹2中打印已删除的驱动程序文件
(this can be done by two lever for loop storing os.walk results in two lists and subsequently subtracting them ) (这可以通过两个杠杆来循环将os.walk结果存储在两个列表中,然后减去它们)
eg. 例如。
folder1\\foo\\bar\\1.c folder2\\foo\\bar\\1.c --> same folder1 \\ foo \\ bar \\ 1.c folder2 \\ foo \\ bar \\ 1.c->相同
folder1\\foo\\car\\2.c folder2\\foo\\car\\2.c --> same folder1 \\ foo \\ car \\ 2.c folder2 \\ foo \\ car \\ 2.c->相同
folder1\\foo\\dar\\13.c folder2\\foo\\dar\\13.c --> different -> return flag=1 folder1 \\ foo \\ dar \\ 13.c folder2 \\ foo \\ dar \\ 13.c->不同->返回标志= 1
folder1\\foo\\far\\211.c folder2\\foo\\far\\211.c --> not compared folder1 \\ foo \\ far \\ 211.c folder2 \\ foo \\ far \\ 211.c->未比较
I have tried to use os.walk(path) function to do this and store all the files in two seperate list. 我试图使用os.walk(path)函数来执行此操作,并将所有文件存储在两个单独的列表中。 but i find it extremely long and complicated for multiple files in this location. 但我发现此位置中的多个文件非常长且复杂。
Also, if there is a method to ignore perforce headers, comments, extra spacing in comparison it would enhance my script 另外,如果有一种方法可以忽略perforce标头,注释和其他空格,那么它将增强我的脚本
Any advice greatly appreciated 任何建议,不胜感激
Here's a rough outline of what I would do: 这是我要做什么的粗略概述:
set()
of folder1 file names folder1_set
创建一个set()
的folder1文件名folder1_set
os.walk()
is your friend, load up folder1 and start to walk the files. os.walk()
是您的朋友,加载folder1并开始浏览文件。 folder1_set
将文件名添加到folder1_set
os.walk()
on folder2, you could keep another set()
of filenames folder2_set
在folder2上使用os.walk()
,可以保留另一个set()
文件名folder2_set
folder1_set - folder2_set
would give you any items in folder1, not in folder2 and vice versa would give you the opposite. folder1_set - folder2_set
将为您提供folder1_set - folder2_set
任何项目,而不是folder1_set - folder2_set
,反之亦然。 Edit 1 编辑1
.strip()
on the line to remove whitespace at either end, then check for the presence of /*
at the start of the stripped line 为了准备注释,您可以逐行读取文件,在该行上使用.strip()
删除任一端的空格,然后在剥离行的开头检查是否存在/*
*/
is present at the end of the line and if so, exclude it. 对于单行注释,您还必须检查*/
是否出现在行尾,如果存在,则将其排除。 */
at the end of the line. 对于多行注释,您可以忽略所有行,直到在行尾遇到*/
为止。 os.path.splitext
to easily find the file extensions whilst walk
ing. 退房os.path.splitext
以便在walk
轻松找到文件扩展名。 You should be able to accomplish this using Python's filecmp library. 您应该能够使用Python的filecmp库完成此操作。
EDITED ANSWER 编辑答案
Addressing additional comments by @DennisNinj 通过@DennisNinj处理其他评论
Thanks, Is there anyway to include .c and .h for file comparison? 谢谢,总有没有包含.c和.h来进行文件比较? I have more than 20 types of files in each folder and over 1000 files in each folder? 每个文件夹中有20多种文件类型,每个文件夹中有1000多种文件? – Dennis Ninj –丹尼斯·宁吉
@DennisNinj Yes, it's possible, just a bit more tricky. @DennisNinj是的,有可能,只是比较棘手。 The currently published version of filecmp.dircmp does not support wildcard or regex matching for its "ignore" and "hide" filters. Filecmp.dircmp当前发布的版本不支持通配符或正则表达式匹配,因为其“忽略”和“隐藏”过滤器。 (There's been a patch submitted to support wildcards in dircmp.) So it means you have to do the filtering manually. (已经提交了一个补丁来支持dircmp中的通配符。)因此,这意味着您必须手动进行过滤。
Here is an updated example that gets you closer to what you're looking to accomplish. 这是一个更新的示例,使您更接近要完成的任务。 ATTENTION : Do note that due to the requirement to stop method execution once a differing C or header file is found, there's a possibility you won't get the file listings of every "added / deleted driver" available in the compared directories since it may not have had a chance to traverse all sub directories. 注意 :请注意,由于需要在找到不同的C或头文件后停止执行方法,因此可能无法在比较目录中获得每个“添加/删除的驱动程序”的文件列表,因为它可能还没有机会遍历所有子目录。
ccodedircomparison.py ccodedircomparison.py
import re
from filecmp import dircmp
def main():
dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
if diffs_found(dcmp):
print "FLAG = 1"
def diffs_found(dcmp):
c_files_regex = re.compile(r".*\.[ch]$")
deleted_drivers = []
if len(dcmp.left_only) > 0:
for left_only_file in dcmp.left_only:
c_files_match = c_files_regex.match(left_only_file)
if c_files_match:
deleted_drivers.append(left_only_file)
if len(deleted_drivers) > 0:
print "Drivers deleted from {dirname}: [{deleted_drivers_list}]".format(dirname=dcmp.right, deleted_drivers_list=', '.join(deleted_drivers))
added_drivers = []
if len(dcmp.right_only) > 0:
for right_only_file in dcmp.right_only:
c_files_match = c_files_regex.match(right_only_file)
if c_files_match:
added_drivers.append(left_only_file)
if len(added_drivers) > 0:
print "Drivers added to {dirname}: [{added_drivers_list}]".format(dirname=dcmp.right, added_drivers_list=', '.join(dcmp.right_only))
if len(dcmp.diff_files) > 0:
differing_c_files = []
for diff_file in dcmp.diff_files:
c_files_match = c_files_regex.match(diff_file)
if c_files_match:
differing_c_files.append(diff_file)
if len(differing_c_files) > 0:
print "C files whose content differs ({dirname}): [{differing_c_files}]".format(dirname=dcmp.right, differing_c_files=', '.join(differing_c_files))
return True
for sub_dcmp in dcmp.subdirs.values():
return diffs_found(sub_dcmp)
return False
if __name__ == '__main__':
main()
Example output 输出示例
(.virtualenvs)macbook:dircomparison joeyoung$ python ccodedircomparison.py
Drivers deleted from /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [thisismissingfromtest2.c]
Drivers added to /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [addedfile1.h]
C files whose content differs (/Users/joeyoung/web/stackoverflow/dircomparison/test2/support): [samefilenamedifftext1.h, samefilename1.c]
FLAG = 1
Test environment directory structure 测试环境目录结构
(.virtualenvs)macbook:dircomparison joeyoung$ tree test1 test2
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ ├── package.json
│ ├── thisismissingfromtest2.c
│ └── thisismissingfromtest2.txt
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── onlyintest1.txt
│ ├── prepare.db
│ ├── samefilename1.c
│ ├── samefilename1.txt
│ ├── samefilenamedifftext1.h
│ ├── samefilenamedsametext1.h
│ ├── script.sql
│ ├── thisismissingfromtest2.c
│ └── thisismissingfromtest2.txt
├── trace.test.js
└── unicode.test.js
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── addedfile1.h
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── prepare.db
│ ├── samefilename1.c
│ ├── samefilename1.txt
│ ├── samefilenamedifftext1.h
│ ├── samefilenamedsametext1.h
│ └── script.sql
├── trace.test.js
└── unicode.test.js
ORIGINAL ANSWER BEFORE THE EDIT IS BELOW 编辑之前的原始答案
My example doesn't do exactly what you describe, but there should be enough between this example and the filecmp.dircmp() documentation to get you started. 我的示例并没有完全按照您的描述进行操作,但是此示例与filecmp.dircmp() 文档之间应该有足够的内容来入门。
dircomparison.py dircomparison.py
from filecmp import dircmp
def main():
dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
if diffs_found(dcmp):
print "DIFFS FOUND!"
else:
print "NO DIFFS FOUND"
def diffs_found(dcmp):
if len(dcmp.left_only) > 0:
print dcmp.report_full_closure()
return True
elif len(dcmp.right_only) > 0:
print dcmp.report_full_closure()
return True
else:
for sub_dcmp in dcmp.subdirs.values():
if diffs_found(sub_dcmp):
return True
return False
if __name__ == '__main__':
main()
Example output 输出示例
(.virtualenvs)macbook:dircomparison joeyoung$ python dircomparison.py
diff /Users/joeyoung/web/stackoverflow/dircomparison/test1/support /Users/joeyoung/web/stackoverflow/dircomparison/test2/support
Only in /Users/joeyoung/web/stackoverflow/dircomparison/test1/support : ['onlyintest1.txt']
Identical files : ['createdb.js', 'elmo.png', 'helper.js', 'prepare.db', 'script.sql']
None
DIFFS FOUND!
The actual directory structures So you can see what my test environment looked like. 实际的目录结构因此您可以看到我的测试环境是什么样的。
(.virtualenvs)macbook:dircomparison joeyoung$ tree test1
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── onlyintest1.txt
│ ├── prepare.db
│ └── script.sql
├── trace.test.js
└── unicode.test.js
2 directories, 33 files
(.virtualenvs)macbook:dircomparison joeyoung$ tree test2
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│ ├── Makefile
│ ├── index.html
│ └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│ ├── createdb.js
│ ├── elmo.png
│ ├── helper.js
│ ├── prepare.db
│ └── script.sql
├── trace.test.js
└── unicode.test.js
2 directories, 32 files
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.