从 python 中的 sklearn 线性回归获取置信区间

Question

我想获得线性回归结果的置信区间。 我正在处理波士顿房价数据集。

我发现了这个问题： How to calculate the 99% confidence interval for slope in a linear regression model in python? 但是，这并不能完全回答我的问题。

这是我的代码：

import numpy as np
import matplotlib.pyplot as plt
from math import pi

import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# import the data
boston_dataset = load_boston()

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']

# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

# model evaluation for training set

y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

# model evaluation for testing set

y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))

# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)

plt.scatter(Y_test, y_test_predict)
plt.show()

例如，如何从中获得 95% 或 99% 的置信区间？ 是否有某种内置的 function 或一段代码？

Answer 1

也许您必须自己构建它，或者您必须为此使用statsmodel 。 根据 sklearn docs: docs ，它没有那个conf inte。
或者您可以按照以下指南进行操作：中

Answer 2

我不确定是否有任何内置的 function 用于此目的，但我所做的是在 n no 上创建一个循环。 次并比较所有模型的准确性，并用 pickle 保存具有最高精度的 model 并在以后重新使用。 代码如下：

for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))

if acc > best:
    best = acc
    with open("confidence_interval.pickle", "wb") as f:
    pickle.dump(linear, f)
    print("The best Accuracy: ", best)

您可以随时更改给定的变量，因为我知道您提供的变量与我的变量有很大不同。 如果你想预测 class 种可能性，你可以使用predict_proba 。 predict和 predict_proba 的区别参考这个链接predict_proba ://www.kaggle.com/questions-and-answers/82657

Answer 3

如果您要计算回归参数的置信区间，一种方法是使用LinearRegression和 numpy 方法的 LinearRegression 结果手动计算它。

下面的代码计算 95% 置信区间 ( alpha=0.05 )。 alpha=0.01将计算 99% 置信区间等。

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression

alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.

# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)

# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)

conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)

使用波士顿住房数据集，上面的代码生成下面的 dataframe：

如果手动代码太多，您可以随时求助于statsmodels并使用其conf_int方法：

import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)

由于它使用相同的公式，因此会生成与上面相同的 output。

统计参考

从 python 中的 sklearn 线性回归获取置信区间

问题描述

2 个解决方案

解决方案1
0 2020-04-18 16:32:41

解决方案2
0 2022-12-03 19:18:49

解决方案3
0 2022-12-04 05:36:01

从 python 中的 sklearn 线性回归获取置信区间

问题描述

2 个解决方案

解决方案1 0 2020-04-18 16:32:41

解决方案2 0 2022-12-03 19:18:49

解决方案3 0 2022-12-04 05:36:01

解决方案1
0 2020-04-18 16:32:41

解决方案2
0 2022-12-03 19:18:49

解决方案3
0 2022-12-04 05:36:01