![](/img/trans.png)
[英]How to calculate the 99% confidence interval for the slope in a linear regression model in python?
[英]Get confidence interval from sklearn linear regression in python
我想獲得線性回歸結果的置信區間。 我正在處理波士頓房價數據集。
我發現了這個問題: How to calculate the 99% confidence interval for slope in a linear regression model in python? 但是,這並不能完全回答我的問題。
這是我的代碼:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# import the data
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
plt.scatter(Y_test, y_test_predict)
plt.show()
例如,如何從中獲得 95% 或 99% 的置信區間? 是否有某種內置的 function 或一段代碼?
我不確定是否有任何內置的 function 用於此目的,但我所做的是在 n no 上創建一個循環。 次並比較所有模型的准確性,並用 pickle 保存具有最高精度的 model 並在以后重新使用。 代碼如下:
for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("confidence_interval.pickle", "wb") as f:
pickle.dump(linear, f)
print("The best Accuracy: ", best)
您可以隨時更改給定的變量,因為我知道您提供的變量與我的變量有很大不同。 如果你想預測 class 種可能性,你可以使用predict_proba
。 predict
和 predict_proba 的區別參考這個鏈接predict_proba
://www.kaggle.com/questions-and-answers/82657
如果您要計算回歸參數的置信區間,一種方法是使用LinearRegression
和 numpy 方法的 LinearRegression 結果手動計算它。
下面的代碼計算 95% 置信區間 ( alpha=0.05
)。 alpha=0.01
將計算 99% 置信區間等。
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)
conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)
使用波士頓住房數據集,上面的代碼生成下面的 dataframe:
如果手動代碼太多,您可以隨時求助於statsmodels
並使用其conf_int
方法:
import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)
由於它使用相同的公式,因此會生成與上面相同的 output。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.