简体   繁体   English

比较两个多列csv文件

[英]Compare two multiple-column csv files

[Using Python3] I want to compare the content of two csv files and let the script print if the contents are the same. [使用Python3]我想比较两个csv文件的内容,如果内容相同则让脚本打印。 In other words, it should let me know if all lines are matched and, if not, the number of rows that are mismatched. 换句话说,如果所有行都匹配,它应该让我知道,如果不匹配,它应该让我知道不匹配的行数。

Also I would like the flexibility to change the code later to write all rows that are not matched to another file. 此外,我希望以后可以灵活地更改代码以写入与另一个文件不匹配的所有行。

Furthermore, although the two files should technically contain exactly the same, the rows may not be ordered the same (except for the first row, which contains headers). 此外,虽然这两个文件在技术上应该完全相同,但行的顺序可能不同(第一行除外,其中包含标题)。

The input files look something like this: 输入文件如下所示:

field1  field2  field3  field4  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
...     ...     ...     ...     ...

The code I am currently running with is the following (below), but to be very honest I am not sure if this is the best (most pythonic) way. 我目前运行的代码如下(下面),但说实话,我不确定这是否是最好的(最pythonic)方式。 Also I am not sure what the try: while 1: ... code is doing. 我也不确定try: while 1: ...是什么try: while 1: ...代码正在做。 This code is the result of my scouring the forum and the python docs. 这段代码是我搜索论坛和python文档的结果。 So far the code runs a very long time. 到目前为止,代码运行了很长时间。

As I am very new I am very keen to receive any feedback on the code, and would also kindly ask for an explanation on any of your possible recommendations. 由于我很新,我非常希望收到有关代码的任何反馈,并且还会请求对您的任何可能建议进行解释。

Code: 码:

import csv
import difflib

'''
Checks the content of two csv files and returns a message.
If there is a mismatch, it will output the number of mismatches.
'''

def compare(f1, f2):

    file1 = open(f1).readlines()
    file2 = open(f2).readlines()

    diff = difflib.ndiff(file1, file2)

    count = 0

    try:
        while 1:
            count += 1
            next(diff)
    except:
        pass

    return 'Checked {} rows and found {} mismatches'.format(len(file1), count)

print (compare('outfile.csv', 'test2.csv'))

Edit: The file can contain duplicates so storing in a set will not work (because it will remove all duplicates, right?). 编辑:文件可以包含重复项,因此存储在一个集合中将不起作用(因为它将删除所有重复项,对吧?)。

The try-while block simply iterates over diff , you should use a for loop instead: try-while块只是迭代diff ,你应该使用for循环:

count = 0
for delta in diff:
    count += 1

or an even more pythonic generator expression 或更加pythonic发电机的表达

count = sum(1 for delta in diff)

(The original code increments count before each iteration and thus gives a count higher by one. I wonder if that is correct in your case.) (原始代码在每次迭代之前递增count ,因此计数值更高。我想知道在您的情况下这是否正确。)

To answer your question about while 1: 回答有关while 1的问题:

Please read more about Generators and iterators. 请阅读有关生成器和迭代器的更多信息。

Diff.ndiff() is a generator, which returns and iterator. Diff.ndiff()是一个生成器,它返回和迭代器。 The loop is iterating over it by calling next(). 循环通过调用next()迭代它。 As long as it finds the diff (iterator moves next) it increments the count (which gives you the total number of rows that differ) 只要它找到diff(迭代器接下来移动),它就会递增计数(这会给你不同的行总数)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM