[英]how to get standardised (Beta) coefficients for multiple linear regression using statsmodels
[英]How to get the unscaled regression coefficients errors using statsmodels?
我正在嘗試使用 statsmodels 計算回歸的系數誤差。 也稱為參數估計值的標准誤差。 但我需要計算他們的“未縮放”版本。 我只設法用 NumPy 做到了。
您可以在文檔中看到“未縮放”的含義: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
cov bool or str, optional
If given and not False, return not just the estimate but also its covariance matrix.
By default, the covariance are scaled by chi2/dof, where dof = M - (deg + 1),
i.e., the weights are presumed to be unreliable except in a relative sense and
everything is scaled such that the reduced chi2 is unity. This scaling is omitted
if cov='unscaled', as is relevant for the case that the weights are w = 1/sigma, with
sigma known to be a reliable estimate of the uncertainty.
我正在使用此數據運行本文中代碼的 rest:
import numpy as np
x = np.array([-0.841, -0.399, 0.599, 0.203, 0.527, 0.129, 0.703, 0.503])
y = np.array([1.01, 1.24, 1.09, 0.95, 1.02, 0.97, 1.01, 0.98])
sigmas = np.array([6872.26, 80.71, 47.97, 699.94, 57.55, 1561.54, 311.98, 501.08])
# The convention for weights are different
sm_weights = np.array([1.0/sigma**2 for sigma in sigmas])
np_weights = np.array([1.0/sigma for sigma in sigmas])
使用 NumPy:
coefficients, cov = np.polyfit(x, y, deg=2, w=np_weights, cov='unscaled')
# The errors I need to get
print(np.sqrt(np.diag(cov))) # [917.57938013 191.2100413 211.29028248]
如果我使用 statsmodels 計算回歸:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as smapi
polynomial_features = PolynomialFeatures(degree=2)
polynomial = polynomial_features.fit_transform(x.reshape(-1, 1))
model = smapi.WLS(y, polynomial, weights=sm_weights)
regression = model.fit()
# Get coefficient errors
# Notice the [::-1], statsmodels returns the coefficients in the reverse order NumPy does
print(regression.bse[::-1]) # [0.24532856, 0.05112286, 0.05649161]
所以我得到的值是不同的,但相關:
np_errors = np.sqrt(np.diag(cov))
sm_errors = regression.bse[::-1]
print(np_errors / sm_errors) # [3740.2061481, 3740.2061481, 3740.2061481]
NumPy 文檔說the covariance are scaled by chi2/dof where dof = M - (deg + 1)
。 所以我嘗試了以下方法:
degree = 2
model_predictions = np.polyval(coefficients, x)
residuals = (model_predictions - y)
chi_squared = np.sum(residuals**2)
degrees_of_freedom = len(x) - (degree + 1)
scale_factor = chi_squared / degrees_of_freedom
sm_cov = regression.cov_params()
unscaled_errors = np.sqrt(np.diag(sm_cov * scale_factor))[::-1] # [0.09848423, 0.02052266, 0.02267789]
unscaled_errors = np.sqrt(np.diag(sm_cov / scale_factor))[::-1] # [0.61112427, 0.12734931, 0.14072311]
我注意到我從 NumPy 得到的協方差矩陣比我從 statsmodels 得到的協方差矩陣大得多:
>>> cov
array([[ 841951.9188366 , -154385.61049538, -188456.18957375],
[-154385.61049538, 36561.27989418, 31208.76422516],
[-188456.18957375, 31208.76422516, 44643.58346933]])
>>> regression.cov_params()
array([[ 0.0031913 , 0.00223093, -0.0134716 ],
[ 0.00223093, 0.00261355, -0.0110361 ],
[-0.0134716 , -0.0110361 , 0.0601861 ]])
只要我不能使它們相等,我就不會得到相同的錯誤。 知道尺度上的差異意味着什么以及如何使兩個協方差矩陣相等嗎?
statsmodels 文檔的某些部分組織得不好。 這是一個帶有以下示例的筆記本https://www.statsmodels.org/devel/examples/notebooks/generated/chi2_fitting.html
OLS 和 WLS 等統計模型中的回歸模型可以選擇保持scale
固定。 這相當於 numpy 和 scipy 中的cov="unscaled"
。statsmodels 選項更通用,因為它允許將比例固定為任何用戶定義的值。
我們有一個如示例中定義的 model,OLS 或 WLS,然后使用
regression = model.fit(cov_type="fixed scale")
將保持比例為 1,並且生成的協方差矩陣未縮放。
使用
regression = model.fit(cov_type="fixed scale", cov_kwds={"scale": 2})
將使比例固定在值二。
(相關討論動機的一些鏈接在https://github.com/statsmodels/statsmodels/pull/2137 )
警告
固定尺度 cov_type 將用於基於參數估計值cov_params
協方差的推論統計。 這會影響標准誤差、t 檢驗、wald 檢驗以及置信區間和預測區間。
但是,某些其他結果統計可能不會調整為使用固定比例而不是估計比例,例如resid_pearson
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.