[英]plotting confidence interval for linear regression line of a pandas time-series Dataframe
I have a sample time-series dataframe:我有一个样本时间序列 dataframe:
df = pd.DataFrame({'year':'1990','1991','1992','1993','1994','1995','1996',
'1997','1998','1999','2000'],
'count':[96,184,148,154,160,149,124,274,322,301,300]})
I want a linear regression
line with confidence interval
band in the regression line
.我想要一条
linear regression
中带有confidence interval
带的regression line
。 Although I managed to plot a linear regression line.虽然我设法 plot 线性回归线。 I am finding it difficult to plot the confidence interval band in the plot.
我发现很难 plot plot 中的置信区间带。 Here is the snippet of my code for linear regression plot:
这是我用于线性回归 plot 的代码片段:
from matplotlib import ticker
from sklearn.linear_model import LinearRegression
X = df.date_ordinal.values.reshape(-1,1)
y = df['count'].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X, y)
predictions = reg.predict(X.reshape(-1, 1))
fig, ax = plt.subplots()
plt.scatter(X, y, color ='blue',alpha=0.5)
plt.plot(X, predictions,alpha=0.5, color = 'black',label = r'$N$'+ '= {:.2f}t + {:.2e}\n'.format(reg.coef_[0][0],reg.intercept_[0]))
plt.ylabel('count($N$)');
plt.xlabel(r'Year(t)');
plt.legend()
formatter = ticker.ScalarFormatter(useMathText=True)
formatter.set_scientific(True)
formatter.set_powerlimits((-1,1))
ax.yaxis.set_major_formatter(formatter)
plt.xticks(ticks = df.date_ordinal[::5], labels = df.index.year[::5])
plt.grid()
plt.show()
plt.clf()
This gives me a nice linear regression plot for time series.这给了我一个很好的线性回归 plot 时间序列。
Problem & Desired output However, I need confidence interval
for the regression line
too as in:.问题和期望的 output但是,我也需要
regression line
的confidence interval
,如下所示:。
Help on this issue would be highly appreciated.对此问题的帮助将不胜感激。
The problem you are running into is that the package and function you use from sklearn.linear_model import LinearRegression
does not provide a way to simply obtain the confidence interval.您遇到的问题是您
from sklearn.linear_model import LinearRegression
使用的 package 和 function 不提供简单获取置信区间的方法。
If you want to absolutely use sklearn.linear_model.LinearRegression
, you will have to dive into the methods of calculating a confidence interval.如果您想绝对使用
sklearn.linear_model.LinearRegression
,则必须深入研究计算置信区间的方法。 One popular approach is using bootstrapping, like was done with this previous answer .一种流行的方法是使用引导,就像之前的答案一样。
However, the way I interpret your question, is that you are looking for a way to quickly do this inside of a plot command, similar to the screenshot you attached.但是,我解释您的问题的方式是,您正在寻找一种在 plot 命令中快速执行此操作的方法,类似于您附加的屏幕截图。 If your goal is purely visualization, then you can simply use the
seaborn
package, which is also where your example image comes from.如果您的目标是纯粹的可视化,那么您可以简单地使用
seaborn
package,这也是您的示例图像的来源。
import seaborn as sns
sns.lmplot(x='year', y='count', data=df, fit_reg=True, ci=95, n_boot=1000)
Where I have highlighted the three self-explanatory parameters of interest with their default values fit_reg
, ci
, and n_boot
.我用它们的默认值
fit_reg
、 ci
和n_boot
突出显示了三个不言自明的参数。 Refer to the documentation for a full description.有关完整说明,请参阅文档。
Under the hood, seaborn
uses the statsmodels
package.在引擎盖下,
seaborn
使用statsmodels
package。 So if you want something in between purely visualization, and writing the confidence interval function from scratch yourself, I would refer you instead to using statsmodels
.因此,如果您想要介于纯可视化和自己从头开始编写置信区间 function 之间的东西,我会建议您使用
statsmodels
。 Specifically, look at the documentation for calculating a confidence interval of an ordinary least squares (OLS) linear regression .具体来说,请查看用于计算普通最小二乘 (OLS) 线性回归的置信区间的文档。
The following code should give you a starting point for using statsmodels in your example:以下代码应该为您在示例中使用 statsmodels 提供了一个起点:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':['1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000'],
'count':[96,184,148,154,160,149,124,274,322,301,300]})
df['year'] = df['year'].astype(float)
X = sm.add_constant(df['year'].values)
ols_model = sm.OLS(df['count'].values, X)
est = ols_model.fit()
out = est.conf_int(alpha=0.05, cols=None)
fig, ax = plt.subplots()
df.plot(x='year',y='count',linestyle='None',marker='s', ax=ax)
y_pred = est.predict(X)
x_pred = df.year.values
ax.plot(x_pred,y_pred)
pred = est.get_prediction(X).summary_frame()
ax.plot(x_pred,pred['mean_ci_lower'],linestyle='--',color='blue')
ax.plot(x_pred,pred['mean_ci_upper'],linestyle='--',color='blue')
# Alternative way to plot
def line(x,b=0,m=1):
return m*x+b
ax.plot(x_pred,line(x_pred,est.params[0],est.params[1]),color='blue')
Which produces your desired output哪个产生您想要的 output
While the values of everything are accessible via standard statsmodels functions.虽然可以通过标准 statsmodels 函数访问所有内容的值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.