從 python 中的 sklearn 線性回歸獲取置信區間

Question

我想獲得線性回歸結果的置信區間。 我正在處理波士頓房價數據集。

我發現了這個問題： How to calculate the 99% confidence interval for slope in a linear regression model in python? 但是，這並不能完全回答我的問題。

這是我的代碼：

import numpy as np
import matplotlib.pyplot as plt
from math import pi

import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# import the data
boston_dataset = load_boston()

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']

# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

# model evaluation for training set

y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

# model evaluation for testing set

y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))

# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)

plt.scatter(Y_test, y_test_predict)
plt.show()

例如，如何從中獲得 95% 或 99% 的置信區間？ 是否有某種內置的 function 或一段代碼？

Answer 1

也許您必須自己構建它，或者您必須為此使用statsmodel 。 根據 sklearn docs: docs ，它沒有那個conf inte。
或者您可以按照以下指南進行操作：中

Answer 2

我不確定是否有任何內置的 function 用於此目的，但我所做的是在 n no 上創建一個循環。 次並比較所有模型的准確性，並用 pickle 保存具有最高精度的 model 並在以后重新使用。 代碼如下：

for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))

if acc > best:
    best = acc
    with open("confidence_interval.pickle", "wb") as f:
    pickle.dump(linear, f)
    print("The best Accuracy: ", best)

您可以隨時更改給定的變量，因為我知道您提供的變量與我的變量有很大不同。 如果你想預測 class 種可能性，你可以使用predict_proba 。 predict和 predict_proba 的區別參考這個鏈接predict_proba ://www.kaggle.com/questions-and-answers/82657

Answer 3

如果您要計算回歸參數的置信區間，一種方法是使用LinearRegression和 numpy 方法的 LinearRegression 結果手動計算它。

下面的代碼計算 95% 置信區間 ( alpha=0.05 )。 alpha=0.01將計算 99% 置信區間等。

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression

alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.

# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)

# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)

conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)

使用波士頓住房數據集，上面的代碼生成下面的 dataframe：

如果手動代碼太多，您可以隨時求助於statsmodels並使用其conf_int方法：

import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)

由於它使用相同的公式，因此會生成與上面相同的 output。

統計參考

從 python 中的 sklearn 線性回歸獲取置信區間

問題描述

2 個解決方案

解決方案1
0 2020-04-18 16:32:41

解決方案2
0 2022-12-03 19:18:49

解決方案3
0 2022-12-04 05:36:01

從 python 中的 sklearn 線性回歸獲取置信區間

問題描述

2 個解決方案

解決方案1 0 2020-04-18 16:32:41

解決方案2 0 2022-12-03 19:18:49

解決方案3 0 2022-12-04 05:36:01

解決方案1
0 2020-04-18 16:32:41

解決方案2
0 2022-12-03 19:18:49

解決方案3
0 2022-12-04 05:36:01