简体   繁体   中英

How to use DataFrame.isin without the constraint of having to match both index and value?

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:

print(df1['Col1'].isin(df2['col3']).value_counts())

This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?

Note: I have checked for whitespace in the columns as well. There is no whitespace.

PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?

First, convert the column you use as parameter inside your isin() method as a list.

Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.

From your example:

print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())

Try running that again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM