绘制 pandas 时间序列 Dataframe 的线性回归线的置信区间

Question

I have a sample time-series dataframe:我有一个样本时间序列 dataframe：

df = pd.DataFrame({'year':'1990','1991','1992','1993','1994','1995','1996',
                          '1997','1998','1999','2000'],
                   'count':[96,184,148,154,160,149,124,274,322,301,300]})

I want a linear regression line with confidence interval band in the regression line .我想要一条linear regression中带有confidence interval带的regression line 。 Although I managed to plot a linear regression line.虽然我设法 plot 线性回归线。 I am finding it difficult to plot the confidence interval band in the plot.我发现很难 plot plot 中的置信区间带。 Here is the snippet of my code for linear regression plot:这是我用于线性回归 plot 的代码片段：

from matplotlib import ticker
from sklearn.linear_model import LinearRegression



X = df.date_ordinal.values.reshape(-1,1)
y = df['count'].values.reshape(-1, 1)

reg = LinearRegression()

reg.fit(X, y)

predictions = reg.predict(X.reshape(-1, 1))

fig, ax = plt.subplots()

plt.scatter(X, y, color ='blue',alpha=0.5)

plt.plot(X, predictions,alpha=0.5, color = 'black',label = r'$N$'+ '= {:.2f}t + {:.2e}\n'.format(reg.coef_[0][0],reg.intercept_[0]))


plt.ylabel('count($N$)');
plt.xlabel(r'Year(t)');
plt.legend()


formatter = ticker.ScalarFormatter(useMathText=True)
formatter.set_scientific(True) 
formatter.set_powerlimits((-1,1)) 
ax.yaxis.set_major_formatter(formatter)


plt.xticks(ticks = df.date_ordinal[::5], labels = df.index.year[::5])

           


plt.grid()  

plt.show()
plt.clf()

This gives me a nice linear regression plot for time series.这给了我一个很好的线性回归 plot 时间序列。

Problem & Desired output However, I need confidence interval for the regression line too as in:.问题和期望的 output但是，我也需要regression line的confidence interval ，如下所示：。

Help on this issue would be highly appreciated.对此问题的帮助将不胜感激。

Answer 1

The problem you are running into is that the package and function you use from sklearn.linear_model import LinearRegression does not provide a way to simply obtain the confidence interval.您遇到的问题是您from sklearn.linear_model import LinearRegression使用的 package 和 function 不提供简单获取置信区间的方法。

If you want to absolutely use sklearn.linear_model.LinearRegression , you will have to dive into the methods of calculating a confidence interval.如果您想绝对使用sklearn.linear_model.LinearRegression ，则必须深入研究计算置信区间的方法。 One popular approach is using bootstrapping, like was done with this previous answer .一种流行的方法是使用引导，就像之前的答案一样。

However, the way I interpret your question, is that you are looking for a way to quickly do this inside of a plot command, similar to the screenshot you attached.但是，我解释您的问题的方式是，您正在寻找一种在 plot 命令中快速执行此操作的方法，类似于您附加的屏幕截图。 If your goal is purely visualization, then you can simply use the seaborn package, which is also where your example image comes from.如果您的目标是纯粹的可视化，那么您可以简单地使用seaborn package，这也是您的示例图像的来源。

import seaborn as sns

sns.lmplot(x='year', y='count', data=df, fit_reg=True, ci=95, n_boot=1000)

Where I have highlighted the three self-explanatory parameters of interest with their default values fit_reg , ci , and n_boot .我用它们的默认值fit_reg 、 ci和n_boot突出显示了三个不言自明的参数。 Refer to the documentation for a full description.有关完整说明，请参阅文档。

Under the hood, seaborn uses the statsmodels package.在引擎盖下， seaborn使用statsmodels package。 So if you want something in between purely visualization, and writing the confidence interval function from scratch yourself, I would refer you instead to using statsmodels .因此，如果您想要介于纯可视化和自己从头开始编写置信区间 function 之间的东西，我会建议您使用statsmodels 。 Specifically, look at the documentation for calculating a confidence interval of an ordinary least squares (OLS) linear regression .具体来说，请查看用于计算普通最小二乘 (OLS) 线性回归的置信区间的文档。

The following code should give you a starting point for using statsmodels in your example:以下代码应该为您在示例中使用 statsmodels 提供了一个起点：

import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

df = pd.DataFrame({'year':['1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000'],
                   'count':[96,184,148,154,160,149,124,274,322,301,300]})
df['year'] = df['year'].astype(float)
X = sm.add_constant(df['year'].values)
ols_model = sm.OLS(df['count'].values, X)
est = ols_model.fit()
out = est.conf_int(alpha=0.05, cols=None)

fig, ax = plt.subplots()
df.plot(x='year',y='count',linestyle='None',marker='s', ax=ax)
y_pred = est.predict(X)
x_pred = df.year.values
ax.plot(x_pred,y_pred)

pred = est.get_prediction(X).summary_frame()
ax.plot(x_pred,pred['mean_ci_lower'],linestyle='--',color='blue')
ax.plot(x_pred,pred['mean_ci_upper'],linestyle='--',color='blue')

# Alternative way to plot
def line(x,b=0,m=1):
    return m*x+b

ax.plot(x_pred,line(x_pred,est.params[0],est.params[1]),color='blue')

Which produces your desired output哪个产生您想要的 output

While the values of everything are accessible via standard statsmodels functions.虽然可以通过标准 statsmodels 函数访问所有内容的值。

绘制 pandas 时间序列 Dataframe 的线性回归线的置信区间

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-28 15:38:32

绘制 pandas 时间序列 Dataframe 的线性回归线的置信区间

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-28 15:38:32

解决方案1
2 已采纳 2021-05-28 15:38:32