简体   繁体   中英

Compare 3 csv Files with Python Pandas

I need to compare 3 csv files to compare against 3 columns (all three columns have the same name on all 3 csv files), to count 1)what is duplicated and 2) what is different (counts only are fine).

Ex. csv1 colB needs checked and compared to csv2 colB and csv3 colb for count totals duplicated(matched on csv2,3) and coutn totals (matched on csv2/3).

All 3 csv's have same column names and colB has ip addreses, colC has hash values, and colD has domain names.

I have tried this for a test at matching colB with failure:

print(df[~df.colB.isin(df1.colB)]) #prints out all columns from df

Tried to add:

print(df[~df.colB.isin(df1.colB).count()]) #get multiple traceback errors

Try with value_counts() you will get the values of True and False.

df.colB.isin(df1.colB).value_counts()

I hope this is what you are looking for.

Let's call the dataframes df1 , df2 , df3 .

Each column in a dataframe is a series, so you can compare them to get a boolean series:

checkB12 = (df1.colB == df2.colB)

This would give a Pandas series object that has ( True, True, False,... ) or something like that.

Similarly,

checkB13 = (df1.colB == df3.colB)

Then,

duplicated = checkB12 or checkB13

This gives you a series of boolean values, with true when there is at least one match of df1 with df2 or df3 . Doing duplicated.sum() will give you the total number of True values ie the total number of cases in df1 which is duplicated at least once in df2 and df3 .

I don't really understand what you mean by "what is different" between the dataframes, so I can't be sure what code you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM