简体   繁体   English

使用python区分两个文件夹-具有相同的子文件夹和文件结构集

[英]Diff two folders using python - having same set of subfolders and file structures

i am trying to write a function in python for comparing two folders (with exactly same subdirectory structure and file lists). 我试图在python中编写一个函数来比较两个文件夹(具有完全相同的子目录结构和文件列表)。

The folder is bound to contain .c .h .txt .cat .sys .pdb, but the major concentration is on C and header files. 该文件夹必然包含.c .h .txt .cat .sys .pdb,但是主要集中在C和头文件上。

The output of this diff(folder1, folder2) should return the following 此diff(folder1,folder2)的输出应返回以下内容

  1. print new added driver files (c and .h alone) in folder2 在folder2中打印新添加的驱动程序文件(仅c和.h)

  2. print deleted driver files in folder 2 as compared to folder 1 与文件夹1相比,在文件夹2中打印已删除的驱动程序文件

(this can be done by two lever for loop storing os.walk results in two lists and subsequently subtracting them ) (这可以通过两个杠杆来循环将os.walk结果存储在两个列表中,然后减去它们)

  1. my challenge :diff() function should start comparing files in all subdirectories and if it finds even one (c or h) files modified in folder2, it should end the diff routine and return a FLAG = 1 我的挑战:diff()函数应该开始比较所有子目录中的文件,并且即使找到在folder2中修改过的一个(c或h)文件,也应该结束diff例程并返回FLAG = 1

eg. 例如。

folder1\\foo\\bar\\1.c folder2\\foo\\bar\\1.c --> same folder1 \\ foo \\ bar \\ 1.c folder2 \\ foo \\ bar \\ 1.c->相同

folder1\\foo\\car\\2.c folder2\\foo\\car\\2.c --> same folder1 \\ foo \\ car \\ 2.c folder2 \\ foo \\ car \\ 2.c->相同

folder1\\foo\\dar\\13.c folder2\\foo\\dar\\13.c --> different -> return flag=1 folder1 \\ foo \\ dar \\ 13.c folder2 \\ foo \\ dar \\ 13.c->不同->返回标志= 1

folder1\\foo\\far\\211.c folder2\\foo\\far\\211.c --> not compared folder1 \\ foo \\ far \\ 211.c folder2 \\ foo \\ far \\ 211.c->未比较

I have tried to use os.walk(path) function to do this and store all the files in two seperate list. 我试图使用os.walk(path)函数来执行此操作,并将所有文件存储在两个单独的列表中。 but i find it extremely long and complicated for multiple files in this location. 但我发现此位置中的多个文件非常长且复杂。

Also, if there is a method to ignore perforce headers, comments, extra spacing in comparison it would enhance my script 另外,如果有一种方法可以忽略perforce标头,注释和其他空格,那么它将增强我的脚本

Any advice greatly appreciated 任何建议,不胜感激

Here's a rough outline of what I would do: 这是我要做什么的粗略概述:

  • Create a set() of folder1 file names folder1_set 创建一个set()的folder1文件名folder1_set
  • os.walk() is your friend, load up folder1 and start to walk the files. os.walk()是您的朋友,加载folder1并开始浏览文件。
  • for each file, open it. 对于每个文件,请打开它。 Modifying the path to point to the second folder, open the 2nd file and check for equality 修改路径以指向第二个文件夹,打开第二个文件并检查是否相等
  • Add the file name to folder1_set 将文件名添加到folder1_set
  • Finally, we have to keep a track of files which could be missing in folder2. 最后,我们必须跟踪folder2中可能缺少的文件。 Using os.walk() on folder2, you could keep another set() of filenames folder2_set 在folder2上使用os.walk() ,可以保留另一个set()文件名folder2_set
  • folder1_set - folder2_set would give you any items in folder1, not in folder2 and vice versa would give you the opposite. folder1_set - folder2_set将为您提供folder1_set - folder2_set任何项目,而不是folder1_set - folder2_set ,反之亦然。

Edit 1 编辑1

  • To prepare for comments you can read the file in line by line, using .strip() on the line to remove whitespace at either end, then check for the presence of /* at the start of the stripped line 为了准备注释,您可以逐行读取文件,在该行上使用.strip()删除任一端的空格,然后在剥离行的开头检查是否存在/*
  • For single line comments you must also check if */ is present at the end of the line and if so, exclude it. 对于单行注释,您还必须检查*/是否出现在行尾,如果存在,则将其排除。
  • For multi line comments you can ignore all lines until you come across */ at the end of the line. 对于多行注释,您可以忽略所有行,直到在行尾遇到*/为止。
  • This also has the benefit of clearing out empty lines (removes all spaces). 这还有清除空白行(删除所有空格)的好处。
  • Check out os.path.splitext to easily find the file extensions whilst walk ing. 退房os.path.splitext以便在walk轻松找到文件扩展名。

You should be able to accomplish this using Python's filecmp library. 您应该能够使用Python的filecmp库完成此操作。

EDITED ANSWER 编辑答案

Addressing additional comments by @DennisNinj 通过@DennisNinj处理其他评论

Thanks, Is there anyway to include .c and .h for file comparison? 谢谢,总有没有包含.c和.h来进行文件比较? I have more than 20 types of files in each folder and over 1000 files in each folder? 每个文件夹中有20多种文件类型,每个文件夹中有1000多种文件? – Dennis Ninj –丹尼斯·宁吉

@DennisNinj Yes, it's possible, just a bit more tricky. @DennisNinj是的,有可能,只是比较棘手。 The currently published version of filecmp.dircmp does not support wildcard or regex matching for its "ignore" and "hide" filters. Filecmp.dircmp当前发布的版本不支持通配符或正则表达式匹配,因为其“忽略”和“隐藏”过滤器。 (There's been a patch submitted to support wildcards in dircmp.) So it means you have to do the filtering manually. (已经提交了一个补丁来支持dircmp中的通配符。)因此,这意味着您必须手动进行过滤。

Here is an updated example that gets you closer to what you're looking to accomplish. 这是一个更新的示例,使您更接近要完成的任务。 ATTENTION : Do note that due to the requirement to stop method execution once a differing C or header file is found, there's a possibility you won't get the file listings of every "added / deleted driver" available in the compared directories since it may not have had a chance to traverse all sub directories. 注意 :请注意,由于需要在找到不同的C或头文件后停止执行方法,因此可能无法在比较目录中获得每个“添加/删除的驱动程序”的文件列表,因为它可能还没有机会遍历所有子目录。

ccodedircomparison.py ccodedircomparison.py

import re
from filecmp import dircmp


def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "FLAG = 1"


def diffs_found(dcmp):
    c_files_regex = re.compile(r".*\.[ch]$")
    deleted_drivers = []
    if len(dcmp.left_only) > 0:
        for left_only_file in dcmp.left_only:
            c_files_match = c_files_regex.match(left_only_file)
            if c_files_match:
                deleted_drivers.append(left_only_file)
        if len(deleted_drivers) > 0:
            print "Drivers deleted from {dirname}: [{deleted_drivers_list}]".format(dirname=dcmp.right, deleted_drivers_list=', '.join(deleted_drivers))
    added_drivers = []
    if len(dcmp.right_only) > 0:
        for right_only_file in dcmp.right_only:
            c_files_match = c_files_regex.match(right_only_file)
            if c_files_match:
                added_drivers.append(left_only_file)
        if len(added_drivers) > 0:
            print "Drivers added to {dirname}: [{added_drivers_list}]".format(dirname=dcmp.right, added_drivers_list=', '.join(dcmp.right_only))
    if len(dcmp.diff_files) > 0:
        differing_c_files = []
        for diff_file in dcmp.diff_files:
            c_files_match = c_files_regex.match(diff_file)
            if c_files_match:
                differing_c_files.append(diff_file)
        if len(differing_c_files) > 0:
            print "C files whose content differs ({dirname}): [{differing_c_files}]".format(dirname=dcmp.right, differing_c_files=', '.join(differing_c_files))
            return True
    for sub_dcmp in dcmp.subdirs.values():
        return diffs_found(sub_dcmp)
    return False

if __name__ == '__main__':
    main()

Example output 输出示例

(.virtualenvs)macbook:dircomparison joeyoung$ python ccodedircomparison.py 
Drivers deleted from /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [thisismissingfromtest2.c]
Drivers added to /Users/joeyoung/web/stackoverflow/dircomparison/test2/support: [addedfile1.h]
C files whose content differs (/Users/joeyoung/web/stackoverflow/dircomparison/test2/support): [samefilenamedifftext1.h, samefilename1.c]
FLAG = 1

Test environment directory structure 测试环境目录结构

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1 test2
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   ├── package.json
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   ├── script.sql
│   ├── thisismissingfromtest2.c
│   └── thisismissingfromtest2.txt
├── trace.test.js
└── unicode.test.js
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── addedfile1.h
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   ├── samefilename1.c
│   ├── samefilename1.txt
│   ├── samefilenamedifftext1.h
│   ├── samefilenamedsametext1.h
│   └── script.sql
├── trace.test.js
└── unicode.test.js

ORIGINAL ANSWER BEFORE THE EDIT IS BELOW 编辑之前的原始答案

My example doesn't do exactly what you describe, but there should be enough between this example and the filecmp.dircmp() documentation to get you started. 我的示例并没有完全按照您的描述进行操作,但是此示例与filecmp.dircmp() 文档之间应该有足够的内容来入门。

dircomparison.py dircomparison.py

from filecmp import dircmp

def main():
    dcmp = dircmp("/Users/joeyoung/web/stackoverflow/dircomparison/test1", "/Users/joeyoung/web/stackoverflow/dircomparison/test2")
    if diffs_found(dcmp):
        print "DIFFS FOUND!"
    else:
        print "NO DIFFS FOUND"


def diffs_found(dcmp):
    if len(dcmp.left_only) > 0:
        print dcmp.report_full_closure()
        return True
    elif len(dcmp.right_only) > 0:
        print dcmp.report_full_closure()
        return True
    else:
        for sub_dcmp in dcmp.subdirs.values():
            if diffs_found(sub_dcmp):
                return True
    return False

if __name__ == '__main__':
    main()

Example output 输出示例

(.virtualenvs)macbook:dircomparison joeyoung$ python dircomparison.py 
diff /Users/joeyoung/web/stackoverflow/dircomparison/test1/support /Users/joeyoung/web/stackoverflow/dircomparison/test2/support
Only in /Users/joeyoung/web/stackoverflow/dircomparison/test1/support : ['onlyintest1.txt']
Identical files : ['createdb.js', 'elmo.png', 'helper.js', 'prepare.db', 'script.sql']
None
DIFFS FOUND!

The actual directory structures So you can see what my test environment looked like. 实际的目录结构因此您可以看到我的测试环境是什么样的。

(.virtualenvs)macbook:dircomparison joeyoung$ tree test1
test1
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── onlyintest1.txt
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 33 files
(.virtualenvs)macbook:dircomparison joeyoung$ tree test2
test2
├── affected.test.js
├── blob.test.js
├── cache.test.js
├── constants.test.js
├── database_fail.test.js
├── each.test.js
├── exec.test.js
├── extension.test.js
├── fts-content.test.js
├── issue-108.test.js
├── map.test.js
├── named_columns.test.js
├── named_params.test.js
├── null_error.test.js
├── nw
│   ├── Makefile
│   ├── index.html
│   └── package.json
├── open_close.test.js
├── other_objects.test.js
├── parallel_insert.test.js
├── prepare.test.js
├── profile.test.js
├── rerun.test.js
├── scheduling.test.js
├── serialization.test.js
├── support
│   ├── createdb.js
│   ├── elmo.png
│   ├── helper.js
│   ├── prepare.db
│   └── script.sql
├── trace.test.js
└── unicode.test.js

2 directories, 32 files

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 PyDrive (Python) 访问文件夹、子文件夹和子文件 - Accessing folders, subfolders and subfiles using PyDrive (Python) Python:递归计算文件夹和子文件夹中的所有文件类型和大小 - Python: Recursively count all file types and sizes in folders and subfolders 使用Python区分两个文件夹(如Linux中的diff工具) - Diffing two folders (like the diff tool in Linux) with Python 如何使用Python操作系统获取子文件夹和文件夹的数量? - How get number of subfolders and folders using Python os walks? 使用 Python 搜索(在文件夹和子文件夹中)并将文件读取到数据帧列表 - Search (in folders and subfolders ) and read files to a list of dataframes, using Python 当两个具有相同名称的文件夹导入python - import in python when two folders with same name 如何使用Python Generator来区分这两个文件 - How to diff the two files using Python Generator 如何使用python来区分两个html文件 - how to using python to diff two html files 使用python区分两个yaml文件 - Diff two yaml files using python 如何访问相同子文件夹的所有子文件夹名称和包含的文件并制作 XLSX 或 CSV 文件? - How to access all the sub folders name and contained files of the same subfolders and make a XLSX or CSV file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM