[英]Variance inflation factor in ridge regression in python
I'm running a ridge regression on somewhat collinear data. 我正在对有些共线的数据进行岭回归。 One of the methods used to identify a stable fit is a ridge trace and thanks to the great example on scikit-learn , I'm able to do that. 用于识别稳定拟合的方法之一是脊线跟踪,并且由于关于scikit-learn的一个很好的例子,我能够做到这一点。 Another method is to calculate variance inflation factors (VIFs) for each variable as k increases. 另一种方法是在k增加时计算每个变量的方差膨胀因子(VIF)。 When the VIFs decrease to <5 it is an indication the fit is satisfactory. 当VIF减小到<5时,表明适合度令人满意。 Statsmodels has code for VIFs, but it is for an OLS regression. Statsmodels具有VIF代码,但它适用于OLS回归。 I've attempted to alter it to handle a ridge regression. 我试图改变它以处理岭回归。
I'm checking my results against Regression Analysis by Example, 5th edition, chapter 10. My code generates the correct results for k = 0.000, but not after that. 我正在通过示例,第5版,第10章对回归分析检查我的结果。我的代码生成k = 0.000的正确结果,但之后没有。 Working SAS code is available, but I'm not a SAS user and I don't know the differences between that implementation and scikit-learn's (and/or statsmodels's). 工作SAS代码可用,但我不是SAS用户,我不知道该实现与scikit-learn(和/或statsmodels)之间的差异。
I've been stuck on this for a few days so any help would be much appreciated. 我已经坚持了几天,所以任何帮助将不胜感激。
#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm
from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)
X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)
import pylab as pl
from sklearn import linear_model
###############################################################################
# Compute paths
alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X, y)
coefs.append(clf.coef_)
for j, data in enumerate(X.columns):
cols = [col for col in X.columns if col not in [data]]
Z = X[cols]
yy = X.iloc[:,j]
clf2.set_params(alpha=a)
clf2.fit(Z, yy)
r_squared_j = clf2.score(Z, yy)
vif = 1. / (1. - r_squared_j)
print r_squared_j
vif_list[j].append(vif)
pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)
###############################################################################
# Display results
ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()
Variance inflation factor for Ridge regression is just three lines. 岭回归的方差膨胀因子只有三条线。 I checked it with the example on the UCLA statistics page. 我使用UCLA统计页面上的示例进行了检查。
A variation of this will make it into the next statsmodels release. 这种变化将使其成为下一个statsmodels版本。 Here is my current function: 这是我目前的功能:
def vif_ridge(corr_x, pen_factors, is_corr=True):
"""variance inflation factor for Ridge regression
assumes penalization is on standardized variables
data should not include a constant
Parameters
----------
corr_x : array_like
correlation matrix if is_corr=True or original data if is_corr is False.
pen_factors : iterable
iterable of Ridge penalization factors
is_corr : bool
Boolean to indicate how corr_x is interpreted, see corr_x
Returns
-------
vif : ndarray
variance inflation factors for parameters in columns and ridge
penalization factors in rows
could be optimized for repeated calculations
"""
corr_x = np.asarray(corr_x)
if not is_corr:
corr = np.corrcoef(corr_x, rowvar=0, bias=True)
else:
corr = corr_x
eye = np.eye(corr.shape[1])
res = []
for k in pen_factors:
minv = np.linalg.inv(corr + k * eye)
vif = minv.dot(corr).dot(minv)
res.append(np.diag(vif))
return np.asarray(res)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.