How to use DataFrame.isin without the constraint of having to match both index and value?

Question

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:

print(df1['Col1'].isin(df2['col3']).value_counts())

This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?

Note: I have checked for whitespace in the columns as well. There is no whitespace.

PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?

Answer 1

First, convert the column you use as parameter inside your isin() method as a list.

Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.

From your example:

print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())

Try running that again.

How to use DataFrame.isin without the constraint of having to match both index and value?

Question

1 answers

solution1
0 2019-10-16 12:17:02

How to use DataFrame.isin without the constraint of having to match both index and value?

Question

1 answers

solution1 0 2019-10-16 12:17:02

solution1
0 2019-10-16 12:17:02