简体   繁体   English

比较两个不同大小的数据帧并在 Pandas 中创建一个新列

[英]Compare two dataframes with different size and create a new column in Pandas

I've a large dataframe as shown below:我有一个大的 dataframe 如下图:

df1:
         Date      Code  ab-ret
0       1997-07-02  11     NaN
1       1997-07-04  11     NaN
2       1997-07-07  11     NaN
3       1997-07-08  11     NaN
4       1997-07-10  11     NaN
... ... ... ...
377395  2017-12-22  5757    -0.046651
377396  2017-12-26  5757    -0.017728
377397  2017-12-27  5757    0.024860
377398  2017-12-28  5757    0.016094
377399  2017-12-29  5757    -0.052789
377400 rows × 3 columns

I've a smaller dataframe as shown below:我有一个较小的 dataframe 如下图所示:

df2:
              Date         Code
0           2009-03-17       11
1           2010-02-03       11
2           2011-02-14      363
3           2015-01-09      363
4           2010-10-15      365
...                ...      ...
9516        2015-02-24   449479
9517        2015-09-01   449479
9518        2016-04-01   449479
9519        2013-06-21   452095
9520        2015-05-06   553720

[9521 rows x 2 columns]

I want to compare columns 'Date' and 'Code' of each dataframe and whether a row in df1 has the same 'Date' and 'Code' as in a given row of df2 simultaneously.我想比较每个 dataframe 'Date''Code'列,以及df1中的行是否与df2的给定行同时具有相同'Date''Code' Based on that, I want to create a new column in df1 which states 'True' if the above mentioned condition is satisfied and 'false' if not satisfied.基于此,我想在df1中创建一个新列,如果满足上述条件,则声明'True' ,如果不满足,则声明'false' How can it be done fast (not using loops is preferred as it takes a lot of time)?如何快速完成(不使用循环是首选,因为它需要很多时间)?

PS All elements in a row from df2.Date and df2.Code aren't guaranteed to be in a given row of df1.Date and df1.Code. PS 不保证df2.Datedf2.Code的一行中的所有元素都在df1.Date和 df1.Code 的给定行中df1.Code. Also, I want all the rows in df1 to remain( only looking to add a new column in df1 stating if the corresponding 'Date' and 'Code' is present in df2 or not).另外,我希望保留df1中的所有行(只希望在df1中添加一个新列,说明df2中是否存在相应'Date''Code' )。 Hence, I'm not looking to merge or do an inner join.因此,我不打算合并或进行内部连接。

Thus, I want the desired output as:因此,我想要所需的 output 为:

         Date      Code       ab-ret       Match
0       1997-07-02  11         NaN         False
1       1997-07-04  11         NaN         False
2       1997-07-07  11         NaN         False
3       1997-07-08  11         NaN         False
4       1997-07-10  11         NaN         False
... ... ... ...
377395  2017-12-22  5757    -0.046651      True
377396  2017-12-26  5757    -0.017728      True
377397  2017-12-27  5757    0.024860       True
377398  2017-12-28  5757    0.016094       False
377399  2017-12-29  5757    -0.052789      True
377400 rows × 4 columns

It is a merge operation, use the parameter indicator=True to get a column named '_merge' close to the column 'Match' you want to create.这是一个merge操作,使用参数indicator=True得到一个名为 '_merge' 的列靠近你要创建的列 'Match'。 Then just need to convert this column to False/True like in your expected output and drop the _merge column.然后只需像在您预期的 output 中一样将此列转换为 False/True 并drop _merge 列。

df1 = (df1.merge(df2, how='left', indicator=True)
          .assign(Match=lambda x: x['_merge'].eq('both'))
          .drop('_merge', axis=1)
      )

IIUC, you could try also a tuple comparison by pd.DataFrame.set_index() and using pd.DataFrame.isin : IIUC,您也可以尝试通过pd.DataFrame.set_index()并使用pd.DataFrame.isin进行元组比较:

df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())

Example :示例

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00'), 2: pd.Timestamp('1997-07-07 00:00:00')}, 
   'Code': {0: 11, 1: 13, 2: 14}, 'ab-ret': {0: np.nan, 1: np.nan, 2: np.nan}}
df1=pd.DataFrame(d)
df1
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    13     NaN
#2 1997-07-07    14     NaN

d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00')}, 
   'Code': {0: 11, 1: 11}, 'ab-ret': {0: np.nan, 1: np.nan}}
df2=pd.DataFrame(d)
df2
#        Date  Code  ab-ret
#0 1997-07-02    11     NaN
#1 1997-07-04    11     NaN

df1['Match']=df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
df1
#        Date  Code  ab-ret  Match
#0 1997-07-02    11     NaN   True
#1 1997-07-04    13     NaN  False
#2 1997-07-07    14     NaN  False

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 大熊猫通过比较两个数据框创建一个新列 - pandas create a new column by comparing two dataframes 如何比较两个相同大小的数据框并创建一个新的数据框,而在列中没有具有相同值的行 - How to compare two dataframes of the same size and create a new one without the rows that have the same value in a column 比较不同大小的数据框,如果满足条件则创建一个新的 - Compare Dataframes of different size and create a new one if condition is met 比较两个不同的数据框并向列添加新值 - Compare two different dataframes and add new values to a column 在两个不同的DataFrame中匹配字符串值,并在Pandas中创建一个带有匹配指示符的新列 - Match string values in two different DataFrames and create a new column with match indicator in Pandas 比较熊猫中不同大小的数据框并根据比较创建新列 - Comparing dataframes with different sizes in pandas and create new column based on comparison 使用 pandas 比较两个数据帧的列值 - Compare column values of two dataframes using pandas 比较来自两个不同数据框熊猫的列 - Compare columns from two different dataframes pandas 比较 pandas 中 2 个不同数据帧中的列(两个数据帧中只有 1 列相同) - Compare a column in 2 different dataframes in pandas (only 1 column is same in both dataframes) 创建一个新列取决于两个不同数据帧中列中的匹配字符串 - create a new column depend on matching string in columns in two different dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM