[英]Compare two dataframes with different size and create a new column in Pandas
I've a large dataframe as shown below:我有一个大的 dataframe 如下图:
df1:
Date Code ab-ret
0 1997-07-02 11 NaN
1 1997-07-04 11 NaN
2 1997-07-07 11 NaN
3 1997-07-08 11 NaN
4 1997-07-10 11 NaN
... ... ... ...
377395 2017-12-22 5757 -0.046651
377396 2017-12-26 5757 -0.017728
377397 2017-12-27 5757 0.024860
377398 2017-12-28 5757 0.016094
377399 2017-12-29 5757 -0.052789
377400 rows × 3 columns
I've a smaller dataframe as shown below:我有一个较小的 dataframe 如下图所示:
df2:
Date Code
0 2009-03-17 11
1 2010-02-03 11
2 2011-02-14 363
3 2015-01-09 363
4 2010-10-15 365
... ... ...
9516 2015-02-24 449479
9517 2015-09-01 449479
9518 2016-04-01 449479
9519 2013-06-21 452095
9520 2015-05-06 553720
[9521 rows x 2 columns]
I want to compare columns 'Date'
and 'Code'
of each dataframe and whether a row in df1
has the same 'Date'
and 'Code'
as in a given row of df2
simultaneously.我想比较每个 dataframe
'Date'
和'Code'
列,以及df1
中的行是否与df2
的给定行同时具有相同'Date'
和'Code'
。 Based on that, I want to create a new column in df1
which states 'True'
if the above mentioned condition is satisfied and 'false'
if not satisfied.基于此,我想在
df1
中创建一个新列,如果满足上述条件,则声明'True'
,如果不满足,则声明'false'
。 How can it be done fast (not using loops is preferred as it takes a lot of time)?如何快速完成(不使用循环是首选,因为它需要很多时间)?
PS All elements in a row from df2.Date
and df2.Code
aren't guaranteed to be in a given row of df1.Date
and df1.Code.
PS 不保证
df2.Date
和df2.Code
的一行中的所有元素都在df1.Date
和 df1.Code 的给定行中df1.Code.
Also, I want all the rows in df1
to remain( only looking to add a new column in df1
stating if the corresponding 'Date'
and 'Code'
is present in df2
or not).另外,我希望保留
df1
中的所有行(只希望在df1
中添加一个新列,说明df2
中是否存在相应'Date'
和'Code'
)。 Hence, I'm not looking to merge or do an inner join.因此,我不打算合并或进行内部连接。
Thus, I want the desired output as:因此,我想要所需的 output 为:
Date Code ab-ret Match
0 1997-07-02 11 NaN False
1 1997-07-04 11 NaN False
2 1997-07-07 11 NaN False
3 1997-07-08 11 NaN False
4 1997-07-10 11 NaN False
... ... ... ...
377395 2017-12-22 5757 -0.046651 True
377396 2017-12-26 5757 -0.017728 True
377397 2017-12-27 5757 0.024860 True
377398 2017-12-28 5757 0.016094 False
377399 2017-12-29 5757 -0.052789 True
377400 rows × 4 columns
It is a merge
operation, use the parameter indicator=True
to get a column named '_merge' close to the column 'Match' you want to create.这是一个
merge
操作,使用参数indicator=True
得到一个名为 '_merge' 的列靠近你要创建的列 'Match'。 Then just need to convert this column to False/True like in your expected output and drop
the _merge column.然后只需像在您预期的 output 中一样将此列转换为 False/True 并
drop
_merge 列。
df1 = (df1.merge(df2, how='left', indicator=True)
.assign(Match=lambda x: x['_merge'].eq('both'))
.drop('_merge', axis=1)
)
IIUC, you could try also a tuple comparison by pd.DataFrame.set_index()
and using pd.DataFrame.isin
: IIUC,您也可以尝试通过
pd.DataFrame.set_index()
并使用pd.DataFrame.isin
进行元组比较:
df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
Example :示例:
d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00'), 2: pd.Timestamp('1997-07-07 00:00:00')},
'Code': {0: 11, 1: 13, 2: 14}, 'ab-ret': {0: np.nan, 1: np.nan, 2: np.nan}}
df1=pd.DataFrame(d)
df1
# Date Code ab-ret
#0 1997-07-02 11 NaN
#1 1997-07-04 13 NaN
#2 1997-07-07 14 NaN
d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00')},
'Code': {0: 11, 1: 11}, 'ab-ret': {0: np.nan, 1: np.nan}}
df2=pd.DataFrame(d)
df2
# Date Code ab-ret
#0 1997-07-02 11 NaN
#1 1997-07-04 11 NaN
df1['Match']=df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
df1
# Date Code ab-ret Match
#0 1997-07-02 11 NaN True
#1 1997-07-04 13 NaN False
#2 1997-07-07 14 NaN False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.