简体   繁体   English

Pandas使用日期作为索引加入/合并2个数据帧

[英]Pandas join/merge 2 dataframes using date as index

I have 2 big dataframes with date as index. 我有2个大数据帧,日期为索引。 To simplyfy an example let's say that they look like this (number of data in particular date in the first dataframe is not the same as in second): 简单地说一个例子,假设它们看起来像这样(第一个数据帧中特定日期的数据数与第二个数据帧中的数据不同):

df1: DF1:

      Date    X    Y
2000-01-01   x1   y1
2000-01-01   x2   y2
2000-01-02   x3   y3
2000-01-03   x4   y4
2000-01-03   x5   y5
2000-01-03   x6   y6

df2: DF2:

      Date  X_2  Y_2
2000-01-01   X1   Y1
2000-01-01   X2   Y2
2000-01-01   X3   Y3
2000-01-03   X4   Y4
2000-01-03   X5   Y5

The output should look like this (I want merge only data with dates whitch appear in both dataframes): 输出应该如下所示(我想只合并两个数据帧中出现日期的数据):

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01  NaN  NaN   X3   Y3
2000-01-03   x4   y4   X4   Y4
2000-01-03   x5   y5   X5   Y5
2000-01-03   x6   y6  NaN  NaN

I've tried different code combinations and I keep getting duplicated data like this: 我尝试了不同的代码组合,并且我不断得到像这样的重复数据:

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x1   y1   X2   Y2
2000-01-01   x1   y1   X3   Y3
2000-01-01   x2   y2   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01   x2   y2   X3   Y3

I've tried eg result = pd.merge(df1,df2, how='inner', on='Date') What to do in order to get the result I want? 我试过例如result = pd.merge(df1,df2, how='inner', on='Date')怎样做才能得到我想要的结果?

Use cumcount to number the items in each group, when grouped by Date : Date分组时,使用cumcount对每个组中的项目进行编号:

In [107]: df1['count'] = df1.groupby('Date').cumcount()

In [108]: df1
Out[108]: 
         Date   X   Y  count
0  2000-01-01  x1  y1      0
1  2000-01-01  x2  y2      1
2  2000-01-02  x3  y3      0
3  2000-01-03  x4  y4      0
4  2000-01-03  x5  y5      1
5  2000-01-03  x6  y6      2

In [109]: df2['count'] = df2.groupby('Date').cumcount()

In [110]: df2
Out[110]: 
         Date X_2 Y_2  count
0  2000-01-01  X1  Y1      0
1  2000-01-01  X2  Y2      1
2  2000-01-01  X3  Y3      2
3  2000-01-03  X4  Y4      0
4  2000-01-03  X5  Y5      1

By adding the count column, you can now merge on both Date and count which gets you close to the result you want: 通过添加count列,您现在可以合并Datecount ,使您接近所需的结果:

In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]: 
         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
2  2000-01-02   x3   y3      0  NaN  NaN
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

The rows that you wish to remove can be characterized as those where count equals 0 and X or X_2 equals NaN. 您要删除的行可以表示为count等于0且X或X_2等于NaN的行。 Therefore, you could remove those rows with a boolean mask like this: 因此,您可以使用布尔掩码删除这些行,如下所示:

mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]

import pandas as pd

df1 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-02',
  '2000-01-03',
  '2000-01-03',
  '2000-01-03'],
 'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
 'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})

df2 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-01',
  '2000-01-03',
  '2000-01-03'],
 'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
 'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})


df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)

yields 产量

         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

Another way to restrict the merge to only those dates which are common to both df1 and df2 would be find the intersection of df1['Date'] and df2['Date'] first, and then apply pd.merge to sub-DataFrames of df1 and df2 which contain only those dates: 将合并限制为仅df1df2共同的那些日期的另一种方法是首先找到df1['Date']df2['Date'] pd.merge ,然后将pd.merge应用于子数据框架df1df2只包含那些日期:

import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM