使用python区分两个文件夹-具有相同的子文件夹和文件结构集

Question

i am trying to write a function in python for comparing two folders (with exactly same subdirectory structure and file lists). 我试图在python中编写一个函数来比较两个文件夹（具有完全相同的子目录结构和文件列表）。

The folder is bound to contain .c .h .txt .cat .sys .pdb, but the major concentration is on C and header files. 该文件夹必然包含.c .h .txt .cat .sys .pdb，但是主要集中在C和头文件上。

The output of this diff(folder1, folder2) should return the following 此diff（folder1，folder2）的输出应返回以下内容

print new added driver files (c and .h alone) in folder2 在folder2中打印新添加的驱动程序文件（仅c和.h）
print deleted driver files in folder 2 as compared to folder 1 与文件夹1相比，在文件夹2中打印已删除的驱动程序文件

(this can be done by two lever for loop storing os.walk results in two lists and subsequently subtracting them ) （这可以通过两个杠杆来循环将os.walk结果存储在两个列表中，然后减去它们）

my challenge :diff() function should start comparing files in all subdirectories and if it finds even one (c or h) files modified in folder2, it should end the diff routine and return a FLAG = 1 我的挑战：diff（）函数应该开始比较所有子目录中的文件，并且即使找到在folder2中修改过的一个（c或h）文件，也应该结束diff例程并返回FLAG = 1

eg. 例如。

folder1\\foo\\bar\\1.c folder2\\foo\\bar\\1.c --> same folder1 \\ foo \\ bar \\ 1.c folder2 \\ foo \\ bar \\ 1.c->相同

folder1\\foo\\car\\2.c folder2\\foo\\car\\2.c --> same folder1 \\ foo \\ car \\ 2.c folder2 \\ foo \\ car \\ 2.c->相同

folder1\\foo\\dar\\13.c folder2\\foo\\dar\\13.c --> different -> return flag=1 folder1 \\ foo \\ dar \\ 13.c folder2 \\ foo \\ dar \\ 13.c->不同->返回标志= 1

folder1\\foo\\far\\211.c folder2\\foo\\far\\211.c --> not compared folder1 \\ foo \\ far \\ 211.c folder2 \\ foo \\ far \\ 211.c->未比较

I have tried to use os.walk(path) function to do this and store all the files in two seperate list. 我试图使用os.walk（path）函数来执行此操作，并将所有文件存储在两个单独的列表中。 but i find it extremely long and complicated for multiple files in this location. 但我发现此位置中的多个文件非常长且复杂。

Also, if there is a method to ignore perforce headers, comments, extra spacing in comparison it would enhance my script 另外，如果有一种方法可以忽略perforce标头，注释和其他空格，那么它将增强我的脚本

Any advice greatly appreciated 任何建议，不胜感激

Answer 1

Here's a rough outline of what I would do: 这是我要做什么的粗略概述：

Create a set() of folder1 file names folder1_set 创建一个set()的folder1文件名folder1_set
os.walk() is your friend, load up folder1 and start to walk the files. os.walk()是您的朋友，加载folder1并开始浏览文件。
for each file, open it. 对于每个文件，请打开它。 Modifying the path to point to the second folder, open the 2nd file and check for equality 修改路径以指向第二个文件夹，打开第二个文件并检查是否相等
Add the file name to folder1_set 将文件名添加到folder1_set
Finally, we have to keep a track of files which could be missing in folder2. 最后，我们必须跟踪folder2中可能缺少的文件。 Using os.walk() on folder2, you could keep another set() of filenames folder2_set 在folder2上使用os.walk() ，可以保留另一个set()文件名folder2_set
folder1_set - folder2_set would give you any items in folder1, not in folder2 and vice versa would give you the opposite. folder1_set - folder2_set将为您提供folder1_set - folder2_set任何项目，而不是folder1_set - folder2_set ，反之亦然。

Edit 1 编辑1

To prepare for comments you can read the file in line by line, using .strip() on the line to remove whitespace at either end, then check for the presence of /* at the start of the stripped line 为了准备注释，您可以逐行读取文件，在该行上使用.strip()删除任一端的空格，然后在剥离行的开头检查是否存在/*
For single line comments you must also check if */ is present at the end of the line and if so, exclude it. 对于单行注释，您还必须检查*/是否出现在行尾，如果存在，则将其排除。
For multi line comments you can ignore all lines until you come across */ at the end of the line. 对于多行注释，您可以忽略所有行，直到在行尾遇到*/为止。
This also has the benefit of clearing out empty lines (removes all spaces). 这还有清除空白行（删除所有空格）的好处。
Check out os.path.splitext to easily find the file extensions whilst walk ing. 退房os.path.splitext以便在walk轻松找到文件扩展名。

Answer 2

You should be able to accomplish this using Python's filecmp library. 您应该能够使用Python的filecmp库完成此操作。

EDITED ANSWER 编辑答案

Addressing additional comments by @DennisNinj 通过@DennisNinj处理其他评论

Thanks, Is there anyway to include .c and .h for file comparison? 谢谢，总有没有包含.c和.h来进行文件比较？ I have more than 20 types of files in each folder and over 1000 files in each folder? 每个文件夹中有20多种文件类型，每个文件夹中有1000多种文件？ – Dennis Ninj –丹尼斯·宁吉

@DennisNinj Yes, it's possible, just a bit more tricky. @DennisNinj是的，有可能，只是比较棘手。 The currently published version of filecmp.dircmp does not support wildcard or regex matching for its "ignore" and "hide" filters. Filecmp.dircmp当前发布的版本不支持通配符或正则表达式匹配，因为其“忽略”和“隐藏”过滤器。 (There's been a patch submitted to support wildcards in dircmp.) So it means you have to do the filtering manually. （已经提交了一个补丁来支持dircmp中的通配符。）因此，这意味着您必须手动进行过滤。

Here is an updated example that gets you closer to what you're looking to accomplish. 这是一个更新的示例，使您更接近要完成的任务。 ATTENTION : Do note that due to the requirement to stop method execution once a differing C or header file is found, there's a possibility you won't get the file listings of every "added / deleted driver" available in the compared directories since it may not have had a chance to traverse all sub directories. 注意：请注意，由于需要在找到不同的C或头文件后停止执行方法，因此可能无法在比较目录中获得每个“添加/删除的驱动程序”的文件列表，因为它可能还没有机会遍历所有子目录。

ccodedircomparison.py ccodedircomparison.py

import re
from filecmp import dircmp


def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "FLAG = 1"


def diffs_found(dcmp):
    c_files_regex = re.compile(r".*\.[ch]$")
    deleted_drivers = []
    if len(dcmp.left_only) > 0:
        for left_only_file in dcmp.left_only:
            c_files_match = c_files_regex.match(left_only_file)
            if c_files_match:
                deleted_drivers.append(left_only_file)
        if len(deleted_drivers) > 0:
            print "Drivers deleted from {dirname}: [{deleted_drivers_list}]".format(dirname=dcmp.right, deleted_drivers_list=', '.join(deleted_drivers))
    added_drivers = []
    if len(dcmp.right_only) > 0:
        for right_only_file in dcmp.right_only:
            c_files_match = c_files_regex.match(right_only_file)
            if c_files_match:
                added_drivers.append(left_only_file)
        if len(added_drivers) > 0:
            print "Drivers added to {dirname}: [{added_drivers_list}]".format(dirname=dcmp.right, added_drivers_list=', '.join(dcmp.right_only))
    if len(dcmp.diff_files) > 0:
        differing_c_files = []
        for diff_file in dcmp.diff_files:
            c_files_match = c_files_regex.match(diff_file)
            if c_files_match:
                differing_c_files.append(diff_file)
        if len(differing_c_files) > 0:
            print "C files whose content differs ({dirname}): [{differing_c_files}]".format(dirname=dcmp.right, differing_c_files=', '.join(differing_c_files))
            return True
    for sub_dcmp in dcmp.subdirs.values():
        return diffs_found(sub_dcmp)
    return False

if __name__ == '__main__':
    main()

Example output 输出示例

(.virtualenvs)macbook:dircomparison joeyoung$ python ccodedircomparison.py 
Drivers deleted from /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [thisismissingfromtest2.c]
Drivers added to /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [addedfile1.h]
C files whose content differs (/Users/joeyoung/web/stackoverflow/dircomparison/test2/support): [samefilenamedifftext1.h, samefilename1.c]
FLAG = 1

Test environment directory structure 测试环境目录结构

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1 test2
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   ├── package.json
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   ├── script.sql
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── trace.test.js
└── unicode.test.js
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── addedfile1.h
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   └── script.sql
├── trace.test.js
└── unicode.test.js

ORIGINAL ANSWER BEFORE THE EDIT IS BELOW 编辑之前的原始答案

My example doesn't do exactly what you describe, but there should be enough between this example and the filecmp.dircmp() documentation to get you started. 我的示例并没有完全按照您的描述进行操作，但是此示例与filecmp.dircmp（） 文档之间应该有足够的内容来入门。

dircomparison.py dircomparison.py

from filecmp import dircmp

def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "DIFFS FOUND!"
    else:
        print "NO DIFFS FOUND"


def diffs_found(dcmp):
    if len(dcmp.left_only) > 0:
        print dcmp.report_full_closure()
        return True
    elif len(dcmp.right_only) > 0:
        print dcmp.report_full_closure()
        return True
    else:
        for sub_dcmp in dcmp.subdirs.values():
            if diffs_found(sub_dcmp):
                return True
    return False

if __name__ == '__main__':
    main()

Example output 输出示例

(.virtualenvs)macbook:dircomparison joeyoung$ python dircomparison.py 
diff /Users/joeyoung/web/stackoverflow/dircomparison/test1/support /Users/joeyoung/web/stackoverflow/dircomparison/test2/support
Only in /Users/joeyoung/web/stackoverflow/dircomparison/test1/support : ['onlyintest1.txt']
Identical files : ['createdb.js', 'elmo.png', 'helper.js', 'prepare.db', 'script.sql']
None
DIFFS FOUND!

The actual directory structures So you can see what my test environment looked like. 实际的目录结构因此您可以看到我的测试环境是什么样的。

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 33 files
(.virtualenvs)macbook:dircomparison joeyoung$ tree test2
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 32 files

使用python区分两个文件夹-具有相同的子文件夹和文件结构集

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-07-31 20:51:02

解决方案2
1 2015-07-31 21:31:09

使用python区分两个文件夹-具有相同的子文件夹和文件结构集

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-07-31 20:51:02

解决方案2 1 2015-07-31 21:31:09

解决方案1
1 已采纳 2015-07-31 20:51:02

解决方案2
1 2015-07-31 21:31:09