Compare 3 csv Files with Python Pandas

Question

I need to compare 3 csv files to compare against 3 columns (all three columns have the same name on all 3 csv files), to count 1)what is duplicated and 2) what is different (counts only are fine).

Ex. csv1 colB needs checked and compared to csv2 colB and csv3 colb for count totals duplicated(matched on csv2,3) and coutn totals (matched on csv2/3).

All 3 csv's have same column names and colB has ip addreses, colC has hash values, and colD has domain names.

I have tried this for a test at matching colB with failure:

print(df[~df.colB.isin(df1.colB)]) #prints out all columns from df

Tried to add:

print(df[~df.colB.isin(df1.colB).count()]) #get multiple traceback errors

Answer 1

Try with value_counts() you will get the values of True and False.

df.colB.isin(df1.colB).value_counts()

I hope this is what you are looking for.

Answer 2

Let's call the dataframes df1 , df2 , df3 .

Each column in a dataframe is a series, so you can compare them to get a boolean series:

checkB12 = (df1.colB == df2.colB)

This would give a Pandas series object that has ( True, True, False,... ) or something like that.

Similarly,

checkB13 = (df1.colB == df3.colB)

Then,

duplicated = checkB12 or checkB13

This gives you a series of boolean values, with true when there is at least one match of df1 with df2 or df3 . Doing duplicated.sum() will give you the total number of True values ie the total number of cases in df1 which is duplicated at least once in df2 and df3 .

I don't really understand what you mean by "what is different" between the dataframes, so I can't be sure what code you need.

Compare 3 csv Files with Python Pandas

Question

2 answers

solution1
1 ACCPTED 2021-03-29 16:49:31

solution2
1 2021-03-29 20:56:44

Compare 3 csv Files with Python Pandas

Question

2 answers

solution1 1 ACCPTED 2021-03-29 16:49:31

solution2 1 2021-03-29 20:56:44

solution1
1 ACCPTED 2021-03-29 16:49:31

solution2
1 2021-03-29 20:56:44