简体   繁体   中英

Compare two multiple-column csv files

[Using Python3] I want to compare the content of two csv files and let the script print if the contents are the same. In other words, it should let me know if all lines are matched and, if not, the number of rows that are mismatched.

Also I would like the flexibility to change the code later to write all rows that are not matched to another file.

Furthermore, although the two files should technically contain exactly the same, the rows may not be ordered the same (except for the first row, which contains headers).

The input files look something like this:

field1  field2  field3  field4  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
...     ...     ...     ...     ...

The code I am currently running with is the following (below), but to be very honest I am not sure if this is the best (most pythonic) way. Also I am not sure what the try: while 1: ... code is doing. This code is the result of my scouring the forum and the python docs. So far the code runs a very long time.

As I am very new I am very keen to receive any feedback on the code, and would also kindly ask for an explanation on any of your possible recommendations.

Code:

import csv
import difflib

'''
Checks the content of two csv files and returns a message.
If there is a mismatch, it will output the number of mismatches.
'''

def compare(f1, f2):

    file1 = open(f1).readlines()
    file2 = open(f2).readlines()

    diff = difflib.ndiff(file1, file2)

    count = 0

    try:
        while 1:
            count += 1
            next(diff)
    except:
        pass

    return 'Checked {} rows and found {} mismatches'.format(len(file1), count)

print (compare('outfile.csv', 'test2.csv'))

Edit: The file can contain duplicates so storing in a set will not work (because it will remove all duplicates, right?).

The try-while block simply iterates over diff , you should use a for loop instead:

count = 0
for delta in diff:
    count += 1

or an even more pythonic generator expression

count = sum(1 for delta in diff)

(The original code increments count before each iteration and thus gives a count higher by one. I wonder if that is correct in your case.)

To answer your question about while 1:

Please read more about Generators and iterators.

Diff.ndiff() is a generator, which returns and iterator. The loop is iterating over it by calling next(). As long as it finds the diff (iterator moves next) it increments the count (which gives you the total number of rows that differ)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM