I have two dataframes: df1
, df2
that contain two columns, col1
and col2
. I would like to calculate the number of elements in column col1
of df1
that are equal to col2
of df2
. How can I do that?
I assume you're using pandas.
One way is to simply use pd.merge
and merge on the second column, and return the length of that column.
pd.merge(df1, df2, on="column_to_merge")
Pandas does an inner merge by default.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
You can use Series.isin df1.col1.isin(df2.col2).sum()
:
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col2': [1, 3, 5, 7]})
nb_comon_elements = df1.col1.isin(df2.col2).sum()
assert nb_comon_elements == 3
Be cautious depending on your use case because:
df1 = pd.DataFrame({'col1': [1, 1, 1, 2, 7]})
df1.col1.isin(df2.col2).sum()
Would return 4 and not 2, because all 1
from df1.col1
are present in df2.col2
. If that's not the expected behaviour you could drop duplicates from df1.col1
before testing the intersection size:
df1.col1.drop_duplicates().isin(df2.col2).sum()
Which in this example would return 2.
To better understand why this is happening you can have look at what .isin
is returning:
df1['isin df2.col2'] = df1.col1.isin(df2.col2)
Which gives:
col1 isin df2.col2
0 1 True
1 1 True
2 1 True
3 2 False
4 7 True
Now .sum()
adds up the booleans from column isin df2.col2
(a total of 4 True
).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.