简体   繁体   English

获得两个具有不同时间序列的数据帧之间的差异

[英]Get the difference betwen two dataframe with different time series

I have 2 dataframes (df1 and df2) with following format. 我有2个数据帧(df1和df2),格式如下。 df1 is a simulation results. df1是模拟结果。 Hence, df1 is more densely populated timesteps wise (beginning of each monthly). 因此,df1是更加密集的时间步长(每个月的开始)。 df2 is actual observed data. df2是实际观察到的数据。 Hence less available data (whenever is collected). 因此可用的数据较少(无论何时收集)。 Both df1 and df2 have different time series (timesteps) and are compiled for each location basis. df1和df2都有不同的时间序列(时间步长),并根据每个位置进行编译。

Sample data 样本数据

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1: DF1:

    Date    Location    Sim
0   2018-02-01  1   3253
1   2018-03-01  1   3078
2   2018-04-01  1   3222
3   2018-05-01  1   3940
4   2018-06-01  1   3665
5   2018-07-01  1   3856
6   2018-02-01  2   3775
7   2018-03-01  2   3658
8   2018-04-01  2   3056
9   2018-05-01  2   3993
10  2018-06-01  2   3240
11  2018-07-01  2   3054
12  2018-02-01  3   3162
13  2018-03-01  3   3193
14  2018-04-01  3   3627
15  2018-05-01  3   3740
16  2018-06-01  3   3042
17  2018-07-01  3   3569

df2: DF2:

    Date    Location    Observed
0   2018-02-10  1   3668
1   2018-03-18  1   3102
2   2018-04-15  1   3128
3   2018-05-11  1   3485
4   2018-06-12  1   3926
5   2018-07-11  1   3344
6   2018-02-22  2   3134
7   2018-03-31  2   3258
8   2018-04-02  2   3833
9   2018-05-06  2   3883
10  2018-06-30  2   3122
11  2018-07-21  2   3417
12  2018-02-03  3   3551
13  2018-03-04  3   3971
14  2018-04-01  3   3294
15  2018-05-03  3   3207
16  2018-06-05  3   3803
17  2018-07-25  3   3250

在此输入图像描述

I am looking for end results as picture/plot above. 我正在寻找上面的图片/情节的最终结果。 For each 'Location', resample the dates in 'Sim' data to daily freq and then interpolate or extrapolate (if necessary) linearly. 对于每个“位置”,将“Sim”数据中的日期重新采样为每日频率,然后线性插值或外推(如果需要)。 Calculate the Delta (Delta=Observed - Sim) only on dates when 'Observed' data is available. 仅在“观察”数据可用的日期计算Delta(Delta = Observed - Sim)。 Again for each 'Location' a plot similar to one attached above. 同样对于每个“位置”,一个类似于上面附带的图。

My thinking is to use df.groupby method to group each 'Location', series.resample to daily for Sim column in df1. 我的想法是使用df.groupby方法将每个'Location',series.resample分组为df1中的Sim列的每日。 Interpolate linearly df1 daily freq. 每日线性插入df1频率。 Calculate the Delta on dates of Observed. 计算观察日期的Delta。 And then plot them up. 然后策划它们。

For the 1st part of your problem, you could concatenate your 2 dataframes, then interpolate and then filter the result according to the 1st time series. 对于问题的第一部分,您可以连接2个数据帧,然后进行插值,然后根据第1个时间序列过滤结果。

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index('Date')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')

Then, groupby, fill missing values & interpolate: 然后,groupby,填充缺失值并插入:

df1_daily = df1.groupby('Location').resample('D').mean()
df1_daily['Location'] = df1_daily.Location.fillna(method='pad')
df1_daily['Sim'] = df1_daily.Sim.interpolate(method='linear')

Prepare merge &... merge: 准备合并&...合并:

df2_grouped = df2.set_index(['Location',df2.index])
merge = df1_daily.merge(right=df2_grouped, left_index=True, right_index=True, how='left')#.sort_index()

Finally: 最后:

merge['Delta'] = merge.Observed - merge.Sim
merge[['Observed', 'Sim', 'Delta']].groupby('Location').plot.line(marker='o', ms=2)

在此输入图像描述 在此输入图像描述 在此输入图像描述

I would suggest to construct a single dataframe using Series and then interpolate it 我建议使用Series构建一个数据帧,然后对其进行插值

Observed= {0: 3668, 1: 3102, 2: 3128, 3: 3485, 4: 3926, 5: 3344, 6: 3134, 7: 3258, 8: 3833, 9: 3883, 10: 3122, 11: 3417, 12: 3551, 13: 3971, 14: 3294, 15: 3207, 16: 3803, 17: 3250}

y1 = pd.Series(Observed, index=Observed)

df = pd.DataFrame({'Date': {0: '2018-02-01', 1: '2018-03-01', 2: '2018-04-01', 3: '2018-05-01', 4: '2018-06-01', 5: '2018-07-01', 6: '2018-02-01', 7: '2018-03-01', 8: '2018-04-01', 9: '2018-05-01', 10: '2018-06-01', 11: '2018-07-01', 12: '2018-02-01', 13: '2018-03-01', 14: '2018-04-01', 15: '2018-05-01', 16: '2018-06-01', 17: '2018-07-01'}, 'Location': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3}, 
                   'Sim': {0: 3253, 1: 3078, 2: 3222, 3: 3940, 4: 3665, 5: 3856, 6: 3775, 7: 3658, 8: 3056, 9: 3993, 10: 3240, 11: 3054, 12: 3162, 13: 3193, 14: 3627, 15: 3740, 16: 3042, 17: 3569},
                   'Observed':Observed})


df.interpolate('index').reindex(Observed)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算具有不同维度的两个时间序列数据帧的差异列 - Calculating the difference column wise of two time series dataframe with different dimensions 两个不同分辨率的时间序列之间的最大差异 - maximum difference between two time series of different resolution Pandas 结合了两个不同长度的时间序列数据帧 - Pandas combine two different length of time series dataframe Pandas Dataframe 计算每组的时间差和两个不同组之间的时间差 - Pandas Dataframe calculate Time difference for each group and Time difference between two different groups 将具有不同索引的 dataframe 添加到时间序列 - Adding dataframe with different index to time series 从数据帧的不同部分减去两个系列 - Subtraction of two series from different parts of the dataframe 在具有不同采样的同一图上绘制两个 dataframe 时间序列(并使用双 Y 轴) - Plotting two dataframe time-series on same graph with different sampling (and using double Y axis) 绘制具有不同日期的两个时间序列的值 - Plot values of two time series with different dates 根据Time Column中两个值之间的差异,将Dataframe中的每一行重复N次 - Repeat each Row in a Dataframe different N times according to the difference between two value in the Time Column 如何获得 dataframe 的行的最大值和最小值之差并作为系列输入? - How to get the difference of the max and min of the row and input as series for a dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM