简体   繁体   中英

Compare two dataframes with different size and create a new column in Pandas

I've a large dataframe as shown below:

df1:
         Date      Code  ab-ret
0       1997-07-02  11     NaN
1       1997-07-04  11     NaN
2       1997-07-07  11     NaN
3       1997-07-08  11     NaN
4       1997-07-10  11     NaN
... ... ... ...
377395  2017-12-22  5757    -0.046651
377396  2017-12-26  5757    -0.017728
377397  2017-12-27  5757    0.024860
377398  2017-12-28  5757    0.016094
377399  2017-12-29  5757    -0.052789
377400 rows × 3 columns

I've a smaller dataframe as shown below:

df2:
              Date         Code
0           2009-03-17       11
1           2010-02-03       11
2           2011-02-14      363
3           2015-01-09      363
4           2010-10-15      365
...                ...      ...
9516        2015-02-24   449479
9517        2015-09-01   449479
9518        2016-04-01   449479
9519        2013-06-21   452095
9520        2015-05-06   553720

[9521 rows x 2 columns]

I want to compare columns 'Date' and 'Code' of each dataframe and whether a row in df1 has the same 'Date' and 'Code' as in a given row of df2 simultaneously. Based on that, I want to create a new column in df1 which states 'True' if the above mentioned condition is satisfied and 'false' if not satisfied. How can it be done fast (not using loops is preferred as it takes a lot of time)?

PS All elements in a row from df2.Date and df2.Code aren't guaranteed to be in a given row of df1.Date and df1.Code. Also, I want all the rows in df1 to remain( only looking to add a new column in df1 stating if the corresponding 'Date' and 'Code' is present in df2 or not). Hence, I'm not looking to merge or do an inner join.

Thus, I want the desired output as:

         Date      Code       ab-ret       Match
0       1997-07-02  11         NaN         False
1       1997-07-04  11         NaN         False
2       1997-07-07  11         NaN         False
3       1997-07-08  11         NaN         False
4       1997-07-10  11         NaN         False
... ... ... ...
377395  2017-12-22  5757    -0.046651      True
377396  2017-12-26  5757    -0.017728      True
377397  2017-12-27  5757    0.024860       True
377398  2017-12-28  5757    0.016094       False
377399  2017-12-29  5757    -0.052789      True
377400 rows × 4 columns

It is a merge operation, use the parameter indicator=True to get a column named '_merge' close to the column 'Match' you want to create. Then just need to convert this column to False/True like in your expected output and drop the _merge column.

df1 = (df1.merge(df2, how='left', indicator=True)
          .assign(Match=lambda x: x['_merge'].eq('both'))
          .drop('_merge', axis=1)
      )

IIUC, you could try also a tuple comparison by pd.DataFrame.set_index() and using pd.DataFrame.isin :

df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())

Example :

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00'), 2: pd.Timestamp('1997-07-07 00:00:00')}, 
   'Code': {0: 11, 1: 13, 2: 14}, 'ab-ret': {0: np.nan, 1: np.nan, 2: np.nan}}
df1=pd.DataFrame(d)
df1
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    13     NaN
#2 1997-07-07    14     NaN

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00')}, 
   'Code': {0: 11, 1: 11}, 'ab-ret': {0: np.nan, 1: np.nan}}
df2=pd.DataFrame(d)
df2
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    11     NaN

df1['Match']=df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
df1
#        Date  Code  ab-ret  Match
#0 1997-07-02    11     NaN   True
#1 1997-07-04    13     NaN  False
#2 1997-07-07    14     NaN  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM