平均几个时间序列以及置信区间（带有测试代码）

Question

Sounds very complicated but a simple plot will make it easy to understand: 听起来很复杂，但简单的情节将使其易于理解： I have three curves of cumulative sum of some values over time, which are the blue lines. 我有一些曲线随时间推移的累积值之和的三条曲线，即蓝线。

I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval. 我想将三个曲线平均（或以某种统计正确的方式组合）成一条平滑曲线并添加置信区间。

I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. 我尝试了一种简单的解决方案-将所有数据组合到一条曲线中，并通过熊猫的“滚动”功能对其求平均值，以获取其标准差。 I plotted those as the purple curve with the confidence interval around it. 我将其绘制为紫色曲线，并在其周围置信区间。

The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them. 我的真实数据存在问题，并且如上图所示，曲线完全不平滑，置信区间也出现了急剧的跳跃，这也不能很好地表示3条单独的曲线，因为没有跳进他们。

Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval? 有没有更好的方法可以在一条平滑曲线中以良好的置信区间表示3条不同的曲线？

I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves). 我提供了一个测试代码，在python 3.5.1上使用numpy和pandas进行了测试（不要更改种子以获取相同的曲线）。

There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that. 有一些约束-对于我来说，增加“滚动”功能的点数并不是解决方案，因为我的某些数据太短了。

Test code: 测试代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)


## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted =  pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])

df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted =  pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])

df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted =  pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])


## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
    df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time =  pd.concat([df1_combined_sorted['time'],
    df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)


## creating confidence intervals 
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()


## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
        ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()

Answer 1

First of all, your sample code could be re-written to make better use of pd . 首先，可以重写示例代码以更好地利用pd 。 For example 例如

np.random.seed(seed=42)

## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
    times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
    vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
    df =  pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
            reset_index().drop('index', axis=1)
    df['cumulative'] = df.vals.cumsum()
    return df

# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)

# join 
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])

# render function
def render(window=10):
    # compute rolling means and confident intervals
    mean_val = df_all.cumulative.rolling(window).mean()
    std_val = df_all.cumulative.rolling(window).std()
    min_val = mean_val - 2*std_val
    max_val = mean_val + 2*std_val

    plt.figure(figsize=(16,9))
    for df in dfs:
        plt.plot(df.time, df.cumulative, c='blue')

    plt.plot(df_all.time, mean_val, c='r')
    plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
    plt.show()

The reason your curves aren't that smooth is maybe your rolling window is not large enough. 曲线不那么平滑的原因可能是滚动窗口不够大。 You can increase this window size to get smoother graphs. 您可以增加此窗口的大小以获得更平滑的图形。 For example render(20) gives: 例如render(20)给出：

while render(30) gives: 而render(30)给出：

Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. 虽然，更好的方法可能是将df['cumulative']中的每一个df['cumulative']到整个时间窗口，然后计算这些序列的均值/置信区间。 With that in mind, we can modify the code as follows: 考虑到这一点，我们可以如下修改代码：

np.random.seed(seed=42)

## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
    times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
    vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
    # note that we set time as index of the returned data
    df =  pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
    df['cumulative'] = df.vals.cumsum()
    return df

df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)

# rename column for later plotting
for i,df in zip(range(3),dfs):
    df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)

# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()

# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)

# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val

fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)

plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()

and we get: 我们得到：

平均几个时间序列以及置信区间（带有测试代码）

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-03-28 16:55:34

平均几个时间序列以及置信区间（带有测试代码）

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-03-28 16:55:34

解决方案1
2 已采纳 2019-03-28 16:55:34