简体   繁体   中英

Compare 2 csv files and output different rows to a 3rd CSV file using Python 2.7

I am trying to compare two csv files and find the rows that are different using python 2.7. The rows are considered different when all columns are not the same. The files will be the same format with all the same columns and will be in this format.

oldfile.csv
ID      name     Date          Amount
1       John     6/16/2015     $3000
2       Adam     6/16/2015     $4000

newfile.csv
ID      name     Date          Amount
1       John     6/16/2015     $3000
2       Adam     6/16/2015     $4000
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

When I run my script i want the output to be just the bottom two lines and written in a csv file unfortunately I simply cant get my code to work properly. What I have written below prints out the contents of the oldfile.csv and it does not print the different rows. what i want the code to do is print out the last to lines in a output.csv file. ie

output.csv
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

Here is my code python 2.7 code using the csv module.

import csv

f1 = open ("olddata/olddata.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
    oldList1.append(row)

f2 = open ("newdata/newdata.csv")
newFile2 = csv.reader(f2)
newList2 = []
for row in newFile2:
    newList2.append(row)

f1.close()
f2.close()

output =  [row for row in oldList1 if row not in newList2]

print output

unfortunately the code only prints out the content of oldfile.csv. I have been working on it all day and trying different variations but I simply can not get it to work correctly. Again, your help would be greatly appreciated.

You're currently checking for rows that exist in the old file but aren't in the new file . That's not what you want to do.

Instead, you should check for rows that exist in the the new file, but aren't in the new one:

output =  [row for row in newList2 if row not in oldList1]

Also, your CSV files are TSVs, so they won't be loaded properly. You should instruct the csv module to use TSV to open your files. Your code can also be simplified.

Here's what you could use:

import csv

f1 = open ("olddata/olddata.csv")
oldFile1 = csv.reader(f1, delimiter='\t')
oldList1 = list(oldFile1)

f2 = open ("newdata/newdata.csv")
newFile2 = csv.reader(f2, delimiter='\t')
newList2 = list(newFile2)

f1.close()
f2.close()

output1 =  [row for row in newList2 if row not in oldList1]
output2 =  [row for row in oldList1 if row not in newList2]

print output1 + output2

You can use a set if your file looks like the input provided:

with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv") as f2:
    header = next(f1).split()
    st = set(f1)
    with open("out.csv","w") as out:
        wr = csv.writer(out,delimter="\t")
        # write lines only if they are not in the set of lines from olddata/olddata.csv
        wr.writerows((row.split() for row in f2 if row not in st))

You don't need to create a list of the lines in newdata.csv you can iterate over the file object and write or do whatever you want as you go. Also with will automatically close your files.

Or without the csv module just store the lines:

 with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv") as f2:
    header = next(f1)
    st = set(f1)
    with open("out.csv", "w") as out:
        out.writelines((line for line in f2 if line not in st))

Output:

ID      name     Date          Amount
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

Or doing it all with the csv module:

import csv
from itertools import imap
with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv")  f2:
    r1 = csv.reader(f1, delimiter="\t")
    header = next(r1)
    st = set(imap(tuple, r1))
    with open("out.csv", "w") as out:
        wr = csv.writer(out, delimiter="\t")
        r2 = csv.reader(f2, delimiter="\t")
        wr.writerows((row for row in imap(tuple, f2) if row not in st))

If you did not care about order and wanted lines that appear in either but not in both you could use set.symmetric_difference .

import csv
from itertools import imap
with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv")  f2:
    r1 = csv.reader(f1, delimiter="\t")
    header = next(r1)
    st = set(imap(tuple, r1))
    r2 = csv.reader(f2, delimiter="\t")
    print(st.symmetric_difference(imap(tuple, r2)))

Output:

   set([('ID', '', 'name', 'Date', 'Amount'), ('3', 'Sam', '6/17/2015', '$5000'), ('4', 'Dan', '6/17/2015', '$6000')])

sorting the data and writing would still be more efficient than using lists.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM