[英]Get confidence interval from sklearn linear regression in python
I want to get a confidence interval of the result of a linear regression.我想获得线性回归结果的置信区间。 I'm working with the boston house price dataset.
我正在处理波士顿房价数据集。
I've found this question: How to calculate the 99% confidence interval for the slope in a linear regression model in python?我发现了这个问题: How to calculate the 99% confidence interval for slope in a linear regression model in python? However, this doesn't quite answer my question.
但是,这并不能完全回答我的问题。
Here is my code:这是我的代码:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# import the data
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns=['LSTAT', 'RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
plt.scatter(Y_test, y_test_predict)
plt.show()
How can I get, for instance, the 95% or 99% confidence interval from this?例如,如何从中获得 95% 或 99% 的置信区间? Is there some sort of in-built function or piece of code?
是否有某种内置的 function 或一段代码?
I am not sure if there is any in-built function for this purpose, but what I do is create a loop on n no.我不确定是否有任何内置的 function 用于此目的,但我所做的是在 n no 上创建一个循环。 of times and compare the accuracy of all the models and save the model with highest accuracy with pickle and use reuse it later.
次并比较所有模型的准确性,并用 pickle 保存具有最高精度的 model 并在以后重新使用。 Here goes the code:
代码如下:
for _ in range(30):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("confidence_interval.pickle", "wb") as f:
pickle.dump(linear, f)
print("The best Accuracy: ", best)
You can always make changes to the given variables as I know the variables that you have provided varies greatly from mine.您可以随时更改给定的变量,因为我知道您提供的变量与我的变量有很大不同。 and if you want to predict the class possibilities you can use
predict_proba
.如果你想预测 class 种可能性,你可以使用
predict_proba
。 Refer to this link for difference between predict
and predict_proba
https://www.kaggle.com/questions-and-answers/82657 predict
和 predict_proba 的区别参考这个链接predict_proba
://www.kaggle.com/questions-and-answers/82657
If you're looking to compute the confidence interval of the regression parameters, one way is to manually compute it using the results of LinearRegression
from scikit-learn and numpy methods.如果您要计算回归参数的置信区间,一种方法是使用
LinearRegression
和 numpy 方法的 LinearRegression 结果手动计算它。
The code below computes the 95%-confidence interval ( alpha=0.05
).下面的代码计算 95% 置信区间 (
alpha=0.05
)。 alpha=0.01
would compute 99%-confidence interval etc. alpha=0.01
将计算 99% 置信区间等。
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
alpha = 0.05 # for 95% confidence interval; use 0.01 for 99%-CI.
# fit a sklearn LinearRegression model
lin_model = LinearRegression().fit(X_train, Y_train)
# the coefficients of the regression model
coefs = np.r_[[lin_model.intercept_], lin_model.coef_]
# build an auxiliary dataframe with the constant term in it
X_aux = X_train.copy()
X_aux.insert(0, 'const', 1)
# degrees of freedom
dof = -np.diff(X_aux.shape)[0]
# Student's t-distribution table lookup
t_val = stats.t.isf(alpha/2, dof)
# MSE of the residuals
mse = np.sum((Y_train - lin_model.predict(X_train)) ** 2) / dof
# inverse of the variance of the parameters
var_params = np.diag(np.linalg.inv(X_aux.T.dot(X_aux)))
# distance between lower and upper bound of CI
gap = t_val * np.sqrt(mse * var_params)
conf_int = pd.DataFrame({'lower': coefs - gap, 'upper': coefs + gap}, index=X_aux.columns)
Using the Boston housing dataset, the above code produces the dataframe below:使用波士顿住房数据集,上面的代码生成下面的 dataframe:
If this is too much manual code, you can always resort to the statsmodels
and use its conf_int
method:如果手动代码太多,您可以随时求助于
statsmodels
并使用其conf_int
方法:
import statsmodels.api as sm
alpha = 0.05 # 95% confidence interval
lr = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
conf_interval = lr.conf_int(alpha)
Since it uses the same formula, it produces the same output as above.由于它使用相同的公式,因此会生成与上面相同的 output。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.