如何比较两个CSV文件并获得差异？

Question

I have two CSV files,我有两个 CSV 文件，

a1.csv a1.csv

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/

a2.csv a2.csv

city,state,link
Aguila,Arizona,http://www.co.apache.az.us

I want to get the difference.我想得到差异。

Here is my attempt:这是我的尝试：

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')

mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c

Expected Output:预期输出：

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf

But I am getting an error:但我收到一个错误：

Empty DataFrame
Columns: [city, state, link]
Index: []**

I want to check based on the first two rows, then if they are the same, remove it off.我想根据前两行进行检查，然后如果它们相同，请将其删除。

Answer 1

You can use pandas to read in two files, join them and remove all duplicate rows:您可以使用pandas读取两个文件，加入它们并删除所有重复的行：

import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html参考： https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Answer 2

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one.首先，连接 DataFrame，然后删除重复项，同时保留第一个。 Then reset the index to keep it consistent.然后重置索引以保持一致。

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
                                              # of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

如何比较两个CSV文件并获得差异？

问题描述

2 个解决方案

解决方案1
3 2018-02-08 19:56:21

解决方案2
1 已采纳 2018-02-08 20:51:23

如何比较两个CSV文件并获得差异？

问题描述

2 个解决方案

解决方案1 3 2018-02-08 19:56:21

解决方案2 1 已采纳 2018-02-08 20:51:23

解决方案1
3 2018-02-08 19:56:21

解决方案2
1 已采纳 2018-02-08 20:51:23