获得两个具有不同时间序列的数据帧之间的差异

Question

I have 2 dataframes (df1 and df2) with following format. 我有2个数据帧（df1和df2），格式如下。 df1 is a simulation results. df1是模拟结果。 Hence, df1 is more densely populated timesteps wise (beginning of each monthly). 因此，df1是更加密集的时间步长（每个月的开始）。 df2 is actual observed data. df2是实际观察到的数据。 Hence less available data (whenever is collected). 因此可用的数据较少（无论何时收集）。 Both df1 and df2 have different time series (timesteps) and are compiled for each location basis. df1和df2都有不同的时间序列（时间步长），并根据每个位置进行编译。

Sample data 样本数据

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1: DF1：

    Date    Location    Sim
0   2018-02-01  1   3253
1   2018-03-01  1   3078
2   2018-04-01  1   3222
3   2018-05-01  1   3940
4   2018-06-01  1   3665
5   2018-07-01  1   3856
6   2018-02-01  2   3775
7   2018-03-01  2   3658
8   2018-04-01  2   3056
9   2018-05-01  2   3993
10  2018-06-01  2   3240
11  2018-07-01  2   3054
12  2018-02-01  3   3162
13  2018-03-01  3   3193
14  2018-04-01  3   3627
15  2018-05-01  3   3740
16  2018-06-01  3   3042
17  2018-07-01  3   3569

df2: DF2：

    Date    Location    Observed
0   2018-02-10  1   3668
1   2018-03-18  1   3102
2   2018-04-15  1   3128
3   2018-05-11  1   3485
4   2018-06-12  1   3926
5   2018-07-11  1   3344
6   2018-02-22  2   3134
7   2018-03-31  2   3258
8   2018-04-02  2   3833
9   2018-05-06  2   3883
10  2018-06-30  2   3122
11  2018-07-21  2   3417
12  2018-02-03  3   3551
13  2018-03-04  3   3971
14  2018-04-01  3   3294
15  2018-05-03  3   3207
16  2018-06-05  3   3803
17  2018-07-25  3   3250

I am looking for end results as picture/plot above. 我正在寻找上面的图片/情节的最终结果。 For each 'Location', resample the dates in 'Sim' data to daily freq and then interpolate or extrapolate (if necessary) linearly. 对于每个“位置”，将“Sim”数据中的日期重新采样为每日频率，然后线性插值或外推（如果需要）。 Calculate the Delta (Delta=Observed - Sim) only on dates when 'Observed' data is available. 仅在“观察”数据可用的日期计算Delta（Delta = Observed - Sim）。 Again for each 'Location' a plot similar to one attached above. 同样对于每个“位置”，一个类似于上面附带的图。

My thinking is to use df.groupby method to group each 'Location', series.resample to daily for Sim column in df1. 我的想法是使用df.groupby方法将每个'Location'，series.resample分组为df1中的Sim列的每日。 Interpolate linearly df1 daily freq. 每日线性插入df1频率。 Calculate the Delta on dates of Observed. 计算观察日期的Delta。 And then plot them up. 然后策划它们。

Answer 1

For the 1st part of your problem, you could concatenate your 2 dataframes, then interpolate and then filter the result according to the 1st time series. 对于问题的第一部分，您可以连接2个数据帧，然后进行插值，然后根据第1个时间序列过滤结果。

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index('Date')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')

Then, groupby, fill missing values & interpolate: 然后，groupby，填充缺失值并插入：

df1_daily = df1.groupby('Location').resample('D').mean()
df1_daily['Location'] = df1_daily.Location.fillna(method='pad')
df1_daily['Sim'] = df1_daily.Sim.interpolate(method='linear')

Prepare merge &... merge: 准备合并＆...合并：

df2_grouped = df2.set_index(['Location',df2.index])
merge = df1_daily.merge(right=df2_grouped, left_index=True, right_index=True, how='left')#.sort_index()

Finally: 最后：

merge['Delta'] = merge.Observed - merge.Sim
merge[['Observed', 'Sim', 'Delta']].groupby('Location').plot.line(marker='o', ms=2)

Answer 2

I would suggest to construct a single dataframe using Series and then interpolate it 我建议使用Series构建一个数据帧，然后对其进行插值

Observed= {0: 3668, 1: 3102, 2: 3128, 3: 3485, 4: 3926, 5: 3344, 6: 3134, 7: 3258, 8: 3833, 9: 3883, 10: 3122, 11: 3417, 12: 3551, 13: 3971, 14: 3294, 15: 3207, 16: 3803, 17: 3250}

y1 = pd.Series(Observed, index=Observed)

df = pd.DataFrame({'Date': {0: '2018-02-01', 1: '2018-03-01', 2: '2018-04-01', 3: '2018-05-01', 4: '2018-06-01', 5: '2018-07-01', 6: '2018-02-01', 7: '2018-03-01', 8: '2018-04-01', 9: '2018-05-01', 10: '2018-06-01', 11: '2018-07-01', 12: '2018-02-01', 13: '2018-03-01', 14: '2018-04-01', 15: '2018-05-01', 16: '2018-06-01', 17: '2018-07-01'}, 'Location': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3}, 
                   'Sim': {0: 3253, 1: 3078, 2: 3222, 3: 3940, 4: 3665, 5: 3856, 6: 3775, 7: 3658, 8: 3056, 9: 3993, 10: 3240, 11: 3054, 12: 3162, 13: 3193, 14: 3627, 15: 3740, 16: 3042, 17: 3569},
                   'Observed':Observed})


df.interpolate('index').reindex(Observed)

获得两个具有不同时间序列的数据帧之间的差异

问题描述

2 个解决方案

解决方案1
0 2019-04-11 09:10:41

解决方案2
0 2019-04-11 09:21:48

获得两个具有不同时间序列的数据帧之间的差异

问题描述

2 个解决方案

解决方案1 0 2019-04-11 09:10:41

解决方案2 0 2019-04-11 09:21:48

解决方案1
0 2019-04-11 09:10:41

解决方案2
0 2019-04-11 09:21:48