简体   繁体   English

从 python 中的 sklearn 线性回归获取置信区间

[英]Get confidence interval from sklearn linear regression in python

I want to get a confidence interval of the result of a linear regression.我想获得线性回归结果的置信区间。 I'm working with the boston house price dataset.我正在处理波士顿房价数据集。

I've found this question: How to calculate the 99% confidence interval for the slope in a linear regression model in python?我发现了这个问题: How to calculate the 99% confidence interval for slope in a linear regression model in python? However, this doesn't quite answer my question.但是,这并不能完全回答我的问题。

Here is my code:这是我的代码:

import numpy as np
import matplotlib.pyplot as plt
from math import pi

import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# import the data
boston_dataset = load_boston()

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']

# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

# model evaluation for training set

y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

# model evaluation for testing set

y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))

# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)

plt.scatter(Y_test, y_test_predict)
plt.show()

How can I get, for instance, the 95% or 99% confidence interval from this?例如,如何从中获得 95% 或 99% 的置信区间? Is there some sort of in-built function or piece of code?是否有某种内置的 function 或一段代码?

Maybe you have to build it yourself, or you have to use statsmodel for that.也许您必须自己构建它,或者您必须为此使用statsmodel According to sklearn docs: docs , it does not have that conf inte.根据 sklearn docs: docs ,它没有那个conf inte。
Or you can follow this guide: medium或者您可以按照以下指南进行操作:

I am not sure if there is any in-built function for this purpose, but what I do is create a loop on n no.我不确定是否有任何内置的 function 用于此目的,但我所做的是在 n no 上创建一个循环。 of times and compare the accuracy of all the models and save the model with highest accuracy with pickle and use reuse it later.次并比较所有模型的准确性,并用 pickle 保存具有最高精度的 model 并在以后重新使用。 Here goes the code:代码如下:

for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))

if acc > best:
    best = acc
    with open("confidence_interval.pickle", "wb") as f:
    pickle.dump(linear, f)
    print("The best Accuracy: ", best)

You can always make changes to the given variables as I know the variables that you have provided varies greatly from mine.您可以随时更改给定的变量,因为我知道您提供的变量与我的变量有很大不同。 and if you want to predict the class possibilities you can use predict_proba .如果你想预测 class 种可能性,你可以使用predict_proba Refer to this link for difference between predict and predict_proba https://www.kaggle.com/questions-and-answers/82657 predict和 predict_proba 的区别参考这个链接predict_proba ://www.kaggle.com/questions-and-answers/82657

If you're looking to compute the confidence interval of the regression parameters, one way is to manually compute it using the results of LinearRegression from scikit-learn and numpy methods.如果您要计算回归参数的置信区间,一种方法是使用LinearRegression和 numpy 方法的 LinearRegression 结果手动计算它。

The code below computes the 95%-confidence interval ( alpha=0.05 ).下面的代码计算 95% 置信区间 ( alpha=0.05 )。 alpha=0.01 would compute 99%-confidence interval etc. alpha=0.01将计算 99% 置信区间等。

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression

alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.

# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)

# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)

conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)

Using the Boston housing dataset, the above code produces the dataframe below:使用波士顿住房数据集,上面的代码生成下面的 dataframe:

资源


If this is too much manual code, you can always resort to the statsmodels and use its conf_int method:如果手动代码太多,您可以随时求助于statsmodels并使用其conf_int方法:

import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)

Since it uses the same formula, it produces the same output as above.由于它使用相同的公式,因此会生成与上面相同的 output。

Stats reference统计参考

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM