[英]check if one dataframe exists in another
I have 2 dataframes Overall
and df2
.我有 2 个数据
df2
Overall
Overall全面的
Time ID_1 ID_2
2020-02-25 09:24:14 140209 81625000
2020-02-25 09:24:14 140216 91625000
2020-02-25 09:24:18 140219 80250000
2020-02-25 09:24:18 140221 90250000
25/02/2020 09:42:02 143982 39075000
df2 df2
ID_1 ID_2 Time Match?
140209 81625000 25/02/2020 09:24:14 no_match
143983 44075000 25/02/2020 09:42:02 no_match
143982 39075000 25/02/2020 09:42:02 match
143984 39075000 25/02/2020 09:42:02 no_match
I want to check if df2
exists in Overall
and if so does df2.Match?
我想检查
df2
是否存在于Overall
中,如果存在df2.Match?
of that same row say match.同一行的说匹配。 If so return a new column saying yes, if it doesn't say match return no.
如果是,则返回一个新列,表示是,如果它没有说匹配,则返回否。
I have tried我努力了
Overall_1 = pds.merge(Overall, df2, on=….., how='left', indicator= 'Exist')
Overall_1.drop([...], inplace = True, axis =1 )
Overall_1['Exist']= np.where((Overall_1.Exist =='both') & (Overall_1.Match? == match), 'yes', 'no')
But an error prevails但错误盛行
TypeError: Cannot perform 'rand_' with a dtyped [bool] array and scalar of type [float]
So resulting Overall_1
dataframe should look like:因此生成
Overall_1
dataframe 应该如下所示:
Time ID_1 ID_2 Exist
2020-02-25 09:24:14 140209 81625000 No
2020-02-25 09:24:14 140216 91625000 NaN
2020-02-25 09:24:18 140219 80250000 NaN
2020-02-25 09:24:18 140221 90250000 Nan
25/02/2020 09:42:02 143982 39075000 Yes
Using merge
and np.select.
使用
merge
和np.select.
import numpy as np
#df1 = Overall
df3 = pd.merge(df1,df2,on=['ID_1','ID_2','Time'],how='left',indicator='Exists')
col1 = df3['Match?']
col2 = df3['Exists']
conditions = [(col1.eq('match') & (col2.eq('both'))),
(col1.eq('no_match') & (col2.eq('both')))
]
choices = ['yes','no']
df3['Exists'] = np.select(conditions,choices,default=np.nan)
print(df3.drop('Match?',axis=1))
Time ID_1 ID_2 Exists
0 2020-02-25 09:24:14 140209 81625000 no
1 2020-02-25 09:24:14 140216 91625000 nan
2 2020-02-25 09:24:18 140219 80250000 nan
3 2020-02-25 09:24:18 140221 90250000 nan
4 2020-02-25 09:42:02 143982 39075000 yes
or more simply just using replace
dict and .merge
或者更简单地使用
replace
dict 和.merge
df3 = pd.merge(df1,df2,on=['ID_1','ID_2','Time'],how='left')\
.replace({'no_match' : 'no',
'match' : 'yes'})\
.rename(columns={'Match?' : 'Exists'})
print(df3)
Time ID_1 ID_2 Exists
0 2020-02-25 09:24:14 140209 81625000 no
1 2020-02-25 09:24:14 140216 91625000 NaN
2 2020-02-25 09:24:18 140219 80250000 NaN
3 2020-02-25 09:24:18 140221 90250000 NaN
4 2020-02-25 09:42:02 143982 39075000 yes
you can try: df_diff = pd.concat([Overall,df2]).drop_duplicates(keep=False)你可以试试: df_diff = pd.concat([Overall,df2]).drop_duplicates(keep=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.