简体   繁体   English

比较 3 csv 文件与 Python Pandas

[英]Compare 3 csv Files with Python Pandas

I need to compare 3 csv files to compare against 3 columns (all three columns have the same name on all 3 csv files), to count 1)what is duplicated and 2) what is different (counts only are fine).我需要比较 3 个 csv 文件以比较 3 列(所有 3 个 csv 文件上的所有三列都具有相同的名称),以计算 1)什么是重复的和 2)什么是不同的(只计算就可以了)。

Ex.前任。 csv1 colB needs checked and compared to csv2 colB and csv3 colb for count totals duplicated(matched on csv2,3) and coutn totals (matched on csv2/3). csv1 colB 需要检查并与 csv2 colB 和 csv3 colb 进行比较,以了解重复的计数总数(在 csv2,3 上匹配)和 coutn 总数(在 csv2/3 上匹配)。

All 3 csv's have same column names and colB has ip addreses, colC has hash values, and colD has domain names.所有 3 个 csv 都有相同的列名,colB 有 ip 地址,colC 有 hash 值,colD 有域名。

I have tried this for a test at matching colB with failure:我已经尝试过将 colB 与失败匹配的测试:

print(df[~df.colB.isin(df1.colB)]) #prints out all columns from df

Tried to add:尝试添加:

print(df[~df.colB.isin(df1.colB).count()]) #get multiple traceback errors

Try with value_counts() you will get the values of True and False.尝试使用value_counts() ,您将获得 True 和 False 的值。

df.colB.isin(df1.colB).value_counts()

I hope this is what you are looking for.我希望这就是你要找的。

Let's call the dataframes df1 , df2 , df3 .我们称数据帧为df1df2df3

Each column in a dataframe is a series, so you can compare them to get a boolean series: dataframe 中的每一列都是一个系列,因此您可以比较它们以获得 boolean 系列:

checkB12 = (df1.colB == df2.colB)

This would give a Pandas series object that has ( True, True, False,... ) or something like that.这将给出一个 Pandas 系列 object 具有( True, True, False,... )或类似的东西。

Similarly,相似地,

checkB13 = (df1.colB == df3.colB)

Then,然后,

duplicated = checkB12 or checkB13

This gives you a series of boolean values, with true when there is at least one match of df1 with df2 or df3 .这为您提供了一系列 boolean 值,当df1df2df3至少匹配时为 true 。 Doing duplicated.sum() will give you the total number of True values ie the total number of cases in df1 which is duplicated at least once in df2 and df3 .执行duplicated.sum()将为您提供 True 值的总数,即df1中的案例总数,在df2df3中至少重复一次。

I don't really understand what you mean by "what is different" between the dataframes, so I can't be sure what code you need.我真的不明白你所说的数据框之间的“有什么不同”是什么意思,所以我不能确定你需要什么代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM