简体   繁体   English

使用Python 2.7比较2个csv文件并将不同的行输出到第3个CSV文件

[英]Compare 2 csv files and output different rows to a 3rd CSV file using Python 2.7

I am trying to compare two csv files and find the rows that are different using python 2.7. 我正在尝试比较两个csv文件,并使用python 2.7查找不同的行。 The rows are considered different when all columns are not the same. 当所有列都不相同时,行被认为是不同的。 The files will be the same format with all the same columns and will be in this format. 文件将具有相同的格式,且具有相同的列,并且将采用这种格式。

oldfile.csv
ID      name     Date          Amount
1       John     6/16/2015     $3000
2       Adam     6/16/2015     $4000

newfile.csv
ID      name     Date          Amount
1       John     6/16/2015     $3000
2       Adam     6/16/2015     $4000
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

When I run my script i want the output to be just the bottom two lines and written in a csv file unfortunately I simply cant get my code to work properly. 当我运行脚本时,不幸的是我希望输出仅是最后两行并写在一个csv文件中,我只是无法使我的代码正常工作。 What I have written below prints out the contents of the oldfile.csv and it does not print the different rows. 我在下面写的内容将打印出oldfile.csv的内容,并且不会打印不同的行。 what i want the code to do is print out the last to lines in a output.csv file. 我要代码执行的操作是在output.csv文件中最后打印出一行。 ie

output.csv
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

Here is my code python 2.7 code using the csv module. 这是我使用csv模块的代码python 2.7代码。

import csv

f1 = open ("olddata/olddata.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
    oldList1.append(row)

f2 = open ("newdata/newdata.csv")
newFile2 = csv.reader(f2)
newList2 = []
for row in newFile2:
    newList2.append(row)

f1.close()
f2.close()

output =  [row for row in oldList1 if row not in newList2]

print output

unfortunately the code only prints out the content of oldfile.csv. 不幸的是,该代码仅打印出oldfile.csv的内容。 I have been working on it all day and trying different variations but I simply can not get it to work correctly. 我整天都在努力,尝试各种变化,但是我根本无法使其正常工作。 Again, your help would be greatly appreciated. 再次感谢您的帮助。

You're currently checking for rows that exist in the old file but aren't in the new file . 您目前正在检查旧文件中是否存在但新文件中不存在的行。 That's not what you want to do. 那不是你想做的。

Instead, you should check for rows that exist in the the new file, but aren't in the new one: 相反,您应该检查新文件中是否存在但新文件中不存在的行:

output =  [row for row in newList2 if row not in oldList1]

Also, your CSV files are TSVs, so they won't be loaded properly. 另外,您的CSV文件是TSV,因此无法正确加载。 You should instruct the csv module to use TSV to open your files. 您应该指示csv模块使用TSV打开文件。 Your code can also be simplified. 您的代码也可以简化。

Here's what you could use: 这是您可以使用的:

import csv

f1 = open ("olddata/olddata.csv")
oldFile1 = csv.reader(f1, delimiter='\t')
oldList1 = list(oldFile1)

f2 = open ("newdata/newdata.csv")
newFile2 = csv.reader(f2, delimiter='\t')
newList2 = list(newFile2)

f1.close()
f2.close()

output1 =  [row for row in newList2 if row not in oldList1]
output2 =  [row for row in oldList1 if row not in newList2]

print output1 + output2

You can use a set if your file looks like the input provided: 如果文件看起来像提供的输入,则可以使用集合:

with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv") as f2:
    header = next(f1).split()
    st = set(f1)
    with open("out.csv","w") as out:
        wr = csv.writer(out,delimter="\t")
        # write lines only if they are not in the set of lines from olddata/olddata.csv
        wr.writerows((row.split() for row in f2 if row not in st))

You don't need to create a list of the lines in newdata.csv you can iterate over the file object and write or do whatever you want as you go. 您无需在newdata.csv创建行列表,就可以遍历文件对象并编写或随心所欲地做。 Also with will automatically close your files. with将自动关闭您的文件。

Or without the csv module just store the lines: 或者没有csv模块,只需存储以下行:

 with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv") as f2:
    header = next(f1)
    st = set(f1)
    with open("out.csv", "w") as out:
        out.writelines((line for line in f2 if line not in st))

Output: 输出:

ID      name     Date          Amount
3       Sam      6/17/2015     $5000
4       Dan      6/17/2015     $6000

Or doing it all with the csv module: 或使用csv模块完成所有操作:

import csv
from itertools import imap
with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv")  f2:
    r1 = csv.reader(f1, delimiter="\t")
    header = next(r1)
    st = set(imap(tuple, r1))
    with open("out.csv", "w") as out:
        wr = csv.writer(out, delimiter="\t")
        r2 = csv.reader(f2, delimiter="\t")
        wr.writerows((row for row in imap(tuple, f2) if row not in st))

If you did not care about order and wanted lines that appear in either but not in both you could use set.symmetric_difference . 如果您不在乎顺序和想要的行出现在两者中但不在两者中出现,则可以使用set.symmetric_difference

import csv
from itertools import imap
with open("olddata/olddata.csv") as f1, open("newdata/newdata.csv")  f2:
    r1 = csv.reader(f1, delimiter="\t")
    header = next(r1)
    st = set(imap(tuple, r1))
    r2 = csv.reader(f2, delimiter="\t")
    print(st.symmetric_difference(imap(tuple, r2)))

Output: 输出:

   set([('ID', '', 'name', 'Date', 'Amount'), ('3', 'Sam', '6/17/2015', '$5000'), ('4', 'Dan', '6/17/2015', '$6000')])

sorting the data and writing would still be more efficient than using lists. 排序数据和写入仍然比使用列表更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM