繁体   English   中英

Pandas使用日期作为索引加入/合并2个数据帧

[英]Pandas join/merge 2 dataframes using date as index

我有2个大数据帧,日期为索引。 简单地说一个例子,假设它们看起来像这样(第一个数据帧中特定日期的数据数与第二个数据帧中的数据不同):

DF1:

      Date    X    Y
2000-01-01   x1   y1
2000-01-01   x2   y2
2000-01-02   x3   y3
2000-01-03   x4   y4
2000-01-03   x5   y5
2000-01-03   x6   y6

DF2:

      Date  X_2  Y_2
2000-01-01   X1   Y1
2000-01-01   X2   Y2
2000-01-01   X3   Y3
2000-01-03   X4   Y4
2000-01-03   X5   Y5

输出应该如下所示(我想只合并两个数据帧中出现日期的数据):

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01  NaN  NaN   X3   Y3
2000-01-03   x4   y4   X4   Y4
2000-01-03   x5   y5   X5   Y5
2000-01-03   x6   y6  NaN  NaN

我尝试了不同的代码组合,并且我不断得到像这样的重复数据:

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x1   y1   X2   Y2
2000-01-01   x1   y1   X3   Y3
2000-01-01   x2   y2   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01   x2   y2   X3   Y3

我试过例如result = pd.merge(df1,df2, how='inner', on='Date')怎样做才能得到我想要的结果?

Date分组时,使用cumcount对每个组中的项目进行编号:

In [107]: df1['count'] = df1.groupby('Date').cumcount()

In [108]: df1
Out[108]: 
         Date   X   Y  count
0  2000-01-01  x1  y1      0
1  2000-01-01  x2  y2      1
2  2000-01-02  x3  y3      0
3  2000-01-03  x4  y4      0
4  2000-01-03  x5  y5      1
5  2000-01-03  x6  y6      2

In [109]: df2['count'] = df2.groupby('Date').cumcount()

In [110]: df2
Out[110]: 
         Date X_2 Y_2  count
0  2000-01-01  X1  Y1      0
1  2000-01-01  X2  Y2      1
2  2000-01-01  X3  Y3      2
3  2000-01-03  X4  Y4      0
4  2000-01-03  X5  Y5      1

通过添加count列,您现在可以合并Datecount ,使您接近所需的结果:

In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]: 
         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
2  2000-01-02   x3   y3      0  NaN  NaN
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

您要删除的行可以表示为count等于0且X或X_2等于NaN的行。 因此,您可以使用布尔掩码删除这些行,如下所示:

mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]

import pandas as pd

df1 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-02',
  '2000-01-03',
  '2000-01-03',
  '2000-01-03'],
 'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
 'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})

df2 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-01',
  '2000-01-03',
  '2000-01-03'],
 'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
 'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})


df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)

产量

         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

将合并限制为仅df1df2共同的那些日期的另一种方法是首先找到df1['Date']df2['Date'] pd.merge ,然后将pd.merge应用于子数据框架df1df2只包含那些日期:

import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM