[英]Pandas join/merge 2 dataframes using date as index
我有2个大数据帧,日期为索引。 简单地说一个例子,假设它们看起来像这样(第一个数据帧中特定日期的数据数与第二个数据帧中的数据不同):
DF1:
Date X Y
2000-01-01 x1 y1
2000-01-01 x2 y2
2000-01-02 x3 y3
2000-01-03 x4 y4
2000-01-03 x5 y5
2000-01-03 x6 y6
DF2:
Date X_2 Y_2
2000-01-01 X1 Y1
2000-01-01 X2 Y2
2000-01-01 X3 Y3
2000-01-03 X4 Y4
2000-01-03 X5 Y5
输出应该如下所示(我想只合并两个数据帧中出现日期的数据):
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 NaN NaN X3 Y3
2000-01-03 x4 y4 X4 Y4
2000-01-03 x5 y5 X5 Y5
2000-01-03 x6 y6 NaN NaN
我尝试了不同的代码组合,并且我不断得到像这样的重复数据:
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x1 y1 X2 Y2
2000-01-01 x1 y1 X3 Y3
2000-01-01 x2 y2 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 x2 y2 X3 Y3
我试过例如result = pd.merge(df1,df2, how='inner', on='Date')
怎样做才能得到我想要的结果?
按Date
分组时,使用cumcount
对每个组中的项目进行编号:
In [107]: df1['count'] = df1.groupby('Date').cumcount()
In [108]: df1
Out[108]:
Date X Y count
0 2000-01-01 x1 y1 0
1 2000-01-01 x2 y2 1
2 2000-01-02 x3 y3 0
3 2000-01-03 x4 y4 0
4 2000-01-03 x5 y5 1
5 2000-01-03 x6 y6 2
In [109]: df2['count'] = df2.groupby('Date').cumcount()
In [110]: df2
Out[110]:
Date X_2 Y_2 count
0 2000-01-01 X1 Y1 0
1 2000-01-01 X2 Y2 1
2 2000-01-01 X3 Y3 2
3 2000-01-03 X4 Y4 0
4 2000-01-03 X5 Y5 1
通过添加count
列,您现在可以合并Date
和count
,使您接近所需的结果:
In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]:
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
2 2000-01-02 x3 y3 0 NaN NaN
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
您要删除的行可以表示为count等于0且X或X_2等于NaN的行。 因此,您可以使用布尔掩码删除这些行,如下所示:
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
import pandas as pd
df1 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-02',
'2000-01-03',
'2000-01-03',
'2000-01-03'],
'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})
df2 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-01',
'2000-01-03',
'2000-01-03'],
'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})
df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)
产量
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
将合并限制为仅df1
和df2
共同的那些日期的另一种方法是首先找到df1['Date']
和df2['Date']
pd.merge
,然后将pd.merge
应用于子数据框架df1
和df2
只包含那些日期:
import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.