简体   繁体   English

如何比较两个CSV文件并获得差异?

[英]How to compare two CSV files and get the difference?

I have two CSV files,我有两个 CSV 文件,

a1.csv a1.csv

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/

a2.csv a2.csv

city,state,link
Aguila,Arizona,http://www.co.apache.az.us

I want to get the difference.我想得到差异。

Here is my attempt:这是我的尝试:

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')

mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c

Expected Output:预期输出:

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf

But I am getting an error:但我收到一个错误:

Empty DataFrame
Columns: [city, state, link]
Index: []**

I want to check based on the first two rows, then if they are the same, remove it off.我想根据前两行进行检查,然后如果它们相同,请将其删除。

You can use pandas to read in two files, join them and remove all duplicate rows:您可以使用pandas读取两个文件,加入它们并删除所有重复的行:

import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html参考: https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one.首先,连接 DataFrame,然后删除重复项,同时保留第一个。 Then reset the index to keep it consistent.然后重置索引以保持一致。

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
                                              # of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM