简体   繁体   English

Logistic回归statsmodels的概率预测置信区间

[英]Confidence interval of probability prediction from logistic regression statsmodels

I'm trying to recreate a plot from An Introduction to Statistical Learning and I'm having trouble figuring out how to calculate the confidence interval for a probability prediction. 我正在尝试从“统计学习简介”重新创建一个图,我无法弄清楚如何计算概率预测的置信区间。 Specifically, I'm trying to recreate the right-hand panel of this figure ( figure 7.1 ) which is predicting the probability that wage>250 based on a degree 4 polynomial of age with associated 95% confidence intervals. 具体来说,我正在尝试重新创建该图的右侧面板( 图7.1 ),该面板预测工资> 250的概率基于4度多项式的年龄和相关的95%置信区间。 The wage data is here if anyone cares. 如果有人关心,工资数据就在这里

I can predict and plot the predicted probabilities fine with the following code 我可以使用以下代码预测并绘制预测概率

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures

wage = pd.read_csv('../../data/Wage.csv', index_col=0)
wage['wage250'] = 0
wage.loc[wage['wage'] > 250, 'wage250'] = 1

poly = Polynomialfeatures(degree=4)
age = poly.fit_transform(wage['age'].values.reshape(-1, 1))

logit = sm.Logit(wage['wage250'], age).fit()

age_range_poly = poly.fit_transform(np.arange(18, 81).reshape(-1, 1))

y_proba = logit.predict(age_range_poly)

plt.plot(age_range_poly[:, 1], y_proba)

But I'm at a loss as to how the confidence intervals of the predicted probabilities are calculated. 但我对如何计算预测概率的置信区间感到茫然。 I have thought about bootstrapping the data many times to get the distribution of probabilities for each age but I know there is an easier way which is just beyond my grasp. 我已经考虑过多次引导数据以获得每个年龄段的概率分布,但我知道有一种更简单的方法,这是我无法掌握的。

I have the estimated coefficient covariance matrix and the standard errors associated with each estimated coefficient. 我有估计的系数协方差矩阵和与每个估计系数相关的标准误差。 How would I go about calculating the confidence intervals as shown in the right-hand panel of the figure above given this information? 如果给出这些信息,我将如何计算置信区间,如上图右侧面板所示?

Thanks! 谢谢!

You can use delta method to find approximate variance for predicted probability. 您可以使用delta方法查找预测概率的近似方差。 Namely, 也就是说,

var(proba) = np.dot(np.dot(gradient.T, cov), gradient)

where gradient is the vector of derivatives of predicted probability by model coefficients, and cov is the covariance matrix of coefficients. 其中gradient是模型系数的预测概率导数的向量,而cov是系数的协方差矩阵。

Delta method is proven to work asymptotically for all maximum likelihood estimates. 事实证明,Delta方法可以渐近地用于所有最大似然估计。 However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping. 但是,如果您有一个小的训练样本,渐近方法可能无法正常工作,您应该考虑自举。

Here is a toy example of applying delta method to logistic regression: 以下是将delta方法应用于逻辑回归的玩具示例:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# generate data
np.random.seed(1)
x = np.arange(100)
y = (x * 0.5 + np.random.normal(size=100,scale=10)>30)
# estimate the model
X = sm.add_constant(x)
model = sm.Logit(y, X).fit()
proba = model.predict(X) # predicted probability

# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(x, proba)
plt.plot(x, lower, color='g')
plt.plot(x, upper, color='g')
plt.show()

It draws the following nice picture: 它绘制了以下精美图片: 在此输入图像描述

For your example the code would be 对于您的示例,代码将是

proba = logit.predict(age_range_poly)
cov = logit.cov_params()
gradient = (proba * (1 - proba) * age_range_poly.T).T 
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(age_range_poly[:, 1], proba)
plt.plot(age_range_poly[:, 1], lower, color='g')
plt.plot(age_range_poly[:, 1], upper, color='g')
plt.show()

and it would give the following picture 它会给出如下图片

在此输入图像描述

Looks pretty much like a boa-constrictor with an elephant inside. 看起来非常像一个有大象的蟒蛇。

You could compare it with the bootstrap estimates: 您可以将它与bootstrap估计值进行比较:

preds = []
for i in range(1000):
    boot_idx = np.random.choice(len(age), replace=True, size=len(age))
    model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0)
    preds.append(model.predict(age_range_poly))
p = np.array(preds)
plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0))
plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0))
plt.show()

在此输入图像描述

Results of delta method and bootstrap look pretty much the same. delta方法和bootstrap的结果看起来几乎相同。

Authors of the book, however, go the third way. 然而,本书的作者走的是第三种方式。 They use the fact that 他们使用的事实

proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params))) proba = np.exp(np.dot(x,params))/(1 + np.exp(np.dot(x,params)))

and calculate confidence interval for the linear part, and then transform with the logit function 并计算线性部分的置信区间,然后用logit函数进行变换

xb = np.dot(age_range_poly, logit.params)
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly])
upper_xb = xb + c * std_errors
lower_xb = xb - c * std_errors
upper = np.exp(upper_xb) / (1 + np.exp(upper_xb))
lower = np.exp(lower_xb) / (1 + np.exp(lower_xb))
plt.plot(age_range_poly[:, 1], upper)
plt.plot(age_range_poly[:, 1], lower)
plt.show()

So they get the diverging interval: 所以他们得到了不同的间隔:

在此输入图像描述

These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. 这些方法产生如此不同的结果,因为它们假设正常分布的不同事物 (预测概率和对数概率)。 Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. 也就是说,delta方法假设预测的概率是正常的,并且在书中,对数概率是正常的。 In fact, none of them are normal in finite samples, but they all converge to in infinite samples, but their variances converge to zero at the same time. 实际上,它们在有限样本中都不是正常的,但它们都收敛于无限样本中,但它们的方差同时收敛于零。 Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem. 最大似然估计对重新参数化不敏感,但它们的估计分布是,这就是问题所在。

Here is an instructive and efficient method to calculate the standard errors ('se') of the fit ('mean_se') and single observations ('obs_se') on top of a statsmodels Logit().fit() object ('fit'), identical to the method in the book ISLR and the last method from the answer by David Dale: 这是一个有效且有效的方法来计算拟合的标准误差('se')('mean_se')和单个观察('obs_se')在statsmodels Logit().fit()对象('fit')之上),与ISLR一书中的方法和David Dale的答案中的最后一种方法相同:

fit_mean = fit.model.exog.dot(fit.params)
fit_mean_se = ((fit.model.exog*fit.model.exog.dot(fit.cov_params())).sum(axis=1))**0.5
fit_obs_se = ( ((fit.model.endog-fit_mean).std(ddof=fit.params.shape[0]))**2 + \
                fit_mean_se**2 )**0.5

A figure similar to the one in the book ISLR 类似于ISLR一书中的数字

The shaded regions represent the 95% confidence intervals for the fit and single observations. 阴影区域代表拟合和单个观察的95%置信区间。

Ideas for improvement are most welcome. 我们非常欢迎改进的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM