比较两个不同大小的数据帧并在 Pandas 中创建一个新列

Question

I've a large dataframe as shown below:我有一个大的 dataframe 如下图：

df1:
         Date      Code  ab-ret
0       1997-07-02  11     NaN
1       1997-07-04  11     NaN
2       1997-07-07  11     NaN
3       1997-07-08  11     NaN
4       1997-07-10  11     NaN
... ... ... ...
377395  2017-12-22  5757    -0.046651
377396  2017-12-26  5757    -0.017728
377397  2017-12-27  5757    0.024860
377398  2017-12-28  5757    0.016094
377399  2017-12-29  5757    -0.052789
377400 rows × 3 columns

I've a smaller dataframe as shown below:我有一个较小的 dataframe 如下图所示：

df2:
              Date         Code
0           2009-03-17       11
1           2010-02-03       11
2           2011-02-14      363
3           2015-01-09      363
4           2010-10-15      365
...                ...      ...
9516        2015-02-24   449479
9517        2015-09-01   449479
9518        2016-04-01   449479
9519        2013-06-21   452095
9520        2015-05-06   553720

[9521 rows x 2 columns]

I want to compare columns 'Date' and 'Code' of each dataframe and whether a row in df1 has the same 'Date' and 'Code' as in a given row of df2 simultaneously.我想比较每个 dataframe 'Date'和'Code'列，以及df1中的行是否与df2的给定行同时具有相同'Date'和'Code' 。 Based on that, I want to create a new column in df1 which states 'True' if the above mentioned condition is satisfied and 'false' if not satisfied.基于此，我想在df1中创建一个新列，如果满足上述条件，则声明'True' ，如果不满足，则声明'false' 。 How can it be done fast (not using loops is preferred as it takes a lot of time)?如何快速完成（不使用循环是首选，因为它需要很多时间）？

PS All elements in a row from df2.Date and df2.Code aren't guaranteed to be in a given row of df1.Date and df1.Code. PS 不保证df2.Date和df2.Code的一行中的所有元素都在df1.Date和 df1.Code 的给定行中df1.Code. Also, I want all the rows in df1 to remain( only looking to add a new column in df1 stating if the corresponding 'Date' and 'Code' is present in df2 or not).另外，我希望保留df1中的所有行（只希望在df1中添加一个新列，说明df2中是否存在相应'Date'和'Code' ）。 Hence, I'm not looking to merge or do an inner join.因此，我不打算合并或进行内部连接。

Thus, I want the desired output as:因此，我想要所需的 output 为：

         Date      Code       ab-ret       Match
0       1997-07-02  11         NaN         False
1       1997-07-04  11         NaN         False
2       1997-07-07  11         NaN         False
3       1997-07-08  11         NaN         False
4       1997-07-10  11         NaN         False
... ... ... ...
377395  2017-12-22  5757    -0.046651      True
377396  2017-12-26  5757    -0.017728      True
377397  2017-12-27  5757    0.024860       True
377398  2017-12-28  5757    0.016094       False
377399  2017-12-29  5757    -0.052789      True
377400 rows × 4 columns

Answer 1

It is a merge operation, use the parameter indicator=True to get a column named '_merge' close to the column 'Match' you want to create.这是一个merge操作，使用参数indicator=True得到一个名为 '_merge' 的列靠近你要创建的列 'Match'。 Then just need to convert this column to False/True like in your expected output and drop the _merge column.然后只需像在您预期的 output 中一样将此列转换为 False/True 并drop _merge 列。

df1 = (df1.merge(df2, how='left', indicator=True)
          .assign(Match=lambda x: x['_merge'].eq('both'))
          .drop('_merge', axis=1)
      )

Answer 2

IIUC, you could try also a tuple comparison by pd.DataFrame.set_index() and using pd.DataFrame.isin : IIUC，您也可以尝试通过pd.DataFrame.set_index()并使用pd.DataFrame.isin进行元组比较：

df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())

Example :示例：

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00'), 2: pd.Timestamp('1997-07-07 00:00:00')}, 
   'Code': {0: 11, 1: 13, 2: 14}, 'ab-ret': {0: np.nan, 1: np.nan, 2: np.nan}}
df1=pd.DataFrame(d)
df1
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    13     NaN
#2 1997-07-07    14     NaN

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00')}, 
   'Code': {0: 11, 1: 11}, 'ab-ret': {0: np.nan, 1: np.nan}}
df2=pd.DataFrame(d)
df2
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    11     NaN

df1['Match']=df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
df1
#        Date  Code  ab-ret  Match
#0 1997-07-02    11     NaN   True
#1 1997-07-04    13     NaN  False
#2 1997-07-07    14     NaN  False

比较两个不同大小的数据帧并在 Pandas 中创建一个新列

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-29 14:45:58

解决方案2
1 2020-07-29 14:59:56

比较两个不同大小的数据帧并在 Pandas 中创建一个新列

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-29 14:45:58

解决方案2 1 2020-07-29 14:59:56

解决方案1
1 已采纳 2020-07-29 14:45:58

解决方案2
1 2020-07-29 14:59:56