[英]Variance Inflation Factor in Python
我正在嘗試在 python 中的一個簡單數據集中計算每一列的方差膨脹因子 (VIF):
a b c d
1 2 4 4
1 2 6 3
2 3 7 4
3 2 8 5
4 1 9 4
我已經使用usdm 庫中的 vif 函數在 R 中完成了此操作,結果如下:
a <- c(1, 1, 2, 3, 4)
b <- c(2, 2, 3, 2, 1)
c <- c(4, 6, 7, 8, 9)
d <- c(4, 3, 4, 5, 4)
df <- data.frame(a, b, c, d)
vif_df <- vif(df)
print(vif_df)
Variables VIF
a 22.95
b 3.00
c 12.95
d 3.00
但是,當我使用statsmodel vif 函數在 python 中做同樣的事情時,我的結果是:
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
ck = np.column_stack([a, b, c, d])
vif = [variance_inflation_factor(ck, i) for i in range(ck.shape[1])]
print(vif)
Variables VIF
a 47.136986301369774
b 28.931506849315081
c 80.31506849315096
d 40.438356164383549
即使輸入相同,結果也大不相同。 一般來說,statsmodel VIF 函數的結果似乎是錯誤的,但我不確定這是因為我調用它的方式還是函數本身的問題。
我希望有人可以幫助我弄清楚我是否錯誤地調用了 statsmodel 函數或解釋了結果中的差異。 如果是該函數的問題,那么在 python 中是否有任何 VIF 替代方案?
正如其他人以及該函數的作者 Josef Perktold 在這篇文章中所提到的, variance_inflation_factor
期望在解釋變量矩陣中存在一個常數。 在將其值傳遞給函數之前,可以使用add_constant
的 add_constant 將所需的常量添加到數據幀中。
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4],
'b': [2, 2, 3, 2, 1],
'c': [4, 6, 7, 8, 9],
'd': [4, 3, 4, 5, 4]}
)
X = add_constant(df)
>>> pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
const 136.875
a 22.950
b 3.000
c 12.950
d 3.000
dtype: float64
我相信您還可以使用assign
常量添加到數據框的最右側列:
X = df.assign(const=1)
>>> pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
a 22.950
b 3.000
c 12.950
d 3.000
const 136.875
dtype: float64
源代碼本身相當簡潔:
def variance_inflation_factor(exog, exog_idx):
"""
exog : ndarray, (nobs, k_vars)
design matrix with all explanatory variables, as for example used in
regression
exog_idx : int
index of the exogenous variable in the columns of exog
"""
k_vars = exog.shape[1]
x_i = exog[:, exog_idx]
mask = np.arange(k_vars) != exog_idx
x_noti = exog[:, mask]
r_squared_i = OLS(x_i, x_noti).fit().rsquared
vif = 1. / (1. - r_squared_i)
return vif
修改代碼以將所有 VIF 作為一個系列返回也很簡單:
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
def variance_inflation_factors(exog_df):
'''
Parameters
----------
exog_df : dataframe, (nobs, k_vars)
design matrix with all explanatory variables, as for example used in
regression.
Returns
-------
vif : Series
variance inflation factors
'''
exog_df = add_constant(exog_df)
vifs = pd.Series(
[1 / (1. - OLS(exog_df[col].values,
exog_df.loc[:, exog_df.columns != col].values).fit().rsquared)
for col in exog_df],
index=exog_df.columns,
name='VIF'
)
return vifs
>>> variance_inflation_factors(df)
const 136.875
a 22.950
b 3.000
c 12.950
Name: VIF, dtype: float64
根據@T_T 的解決方案,您還可以簡單地執行以下操作:
vifs = pd.Series(np.linalg.inv(df.corr().to_numpy()).diagonal(),
index=df.columns,
name='VIF')
我相信這是由於 Python 的 OLS 不同造成的。 用於python方差膨脹因子計算的OLS,默認不加截距。 但是,您肯定希望在那里進行攔截。
您想要做的是在矩陣 ck 中再添加一列,用 ck 填充以表示常量。 這將是方程的截距項。 完成此操作后,您的值應正確匹配。
編輯:用一個替換零
對於這個線程的未來來者(像我一樣):
import numpy as np
import scipy as sp
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
ck = np.column_stack([a, b, c, d])
cc = sp.corrcoef(ck, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
這段代碼給出
array([22.95, 3. , 12.95, 3. ])
[編輯]
為了回應評論,我嘗試盡可能多地使用DataFrame
(需要numpy
來反轉矩陣)。
import pandas as pd
import numpy as np
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
df_cor = df.corr()
pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)
代碼給出
a b c d
a 22.950000 6.453681 -16.301917 -6.453681
b 6.453681 3.000000 -4.080441 -2.000000
c -16.301917 -4.080441 12.950000 4.080441
d -6.453681 -2.000000 4.080441 3.000000
對角線元素給出 VIF。
如果您不想處理variance_inflation_factor
和add_constant
。 請考慮以下兩個函數。
1.在statasmodels中使用公式:
import pandas as pd
import statsmodels.formula.api as smf
def get_vif(exogs, data):
'''Return VIF (variance inflation factor) DataFrame
Args:
exogs (list): list of exogenous/independent variables
data (DataFrame): the df storing all variables
Returns:
VIF and Tolerance DataFrame for each exogenous variable
Notes:
Assume we have a list of exogenous variable [X1, X2, X3, X4].
To calculate the VIF and Tolerance for each variable, we regress
each of them against other exogenous variables. For instance, the
regression model for X3 is defined as:
X3 ~ X1 + X2 + X4
And then we extract the R-squared from the model to calculate:
VIF = 1 / (1 - R-squared)
Tolerance = 1 - R-squared
The cutoff to detect multicollinearity:
VIF > 10 or Tolerance < 0.1
'''
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# create formula for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
formula = f"{exog} ~ {' + '.join(not_exog)}"
# extract r-squared from the fit
r_squared = smf.ols(formula, data=data).fit().rsquared
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
2.用LinearRegression
的sklearn:
# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression
def sklearn_vif(exogs, data):
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# form input data for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
X, y = data[not_exog], data[exog]
# extract r-squared from the fit
r_squared = LinearRegression().fit(X, y).score(X, y)
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
例子:
import seaborn as sns
df = sns.load_dataset('car_crashes')
exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']
[In] %%timeit -n 100
get_vif(exogs=exogs, data=df)
[Out]
VIF Tolerance
alcohol 3.436072 0.291030
no_previous 3.113984 0.321132
not_distracted 2.668456 0.374749
speeding 1.884340 0.530690
69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
[In] %%timeit -n 100
sklearn_vif(exogs=exogs, data=df)
[Out]
VIF Tolerance
alcohol 3.436072 0.291030
no_previous 3.113984 0.321132
not_distracted 2.668456 0.374749
speeding 1.884340 0.530690
15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
波士頓數據示例:
VIF是通過輔助回歸計算的,因此不依賴於實際擬合。
見下文:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
# Break into left and right hand side; y and X
y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe")
# For each Xi, calculate VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Fit X to y
result = sm.OLS(y, X).fit()
我根據我在 Stack 和 CrossValidated 上看到的其他一些帖子編寫了這個函數。 它顯示超過閾值的特征,並返回一個刪除了特征的新數據幀。
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
def calculate_vif_(df, thresh=5):
'''
Calculates VIF each feature in a pandas dataframe
A constant must be added to variance_inflation_factor or the results will be incorrect
:param df: the pandas dataframe containing only the predictor features, not the response variable
:param thresh: the max VIF value before the feature is removed from the dataframe
:return: dataframe with features removed
'''
const = add_constant(df)
cols = const.columns
variables = np.arange(const.shape[1])
vif_df = pd.Series([variance_inflation_factor(const.values, i)
for i in range(const.shape[1])],
index=const.columns).to_frame()
vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'})
vif_df = vif_df.drop('const')
vif_df = vif_df[vif_df['VIF'] > thresh]
print 'Features above VIF threshold:\n'
print vif_df[vif_df['VIF'] > thresh]
col_to_drop = list(vif_df.index)
for i in col_to_drop:
print 'Dropping: {}'.format(i)
df = df.drop(columns=i)
return df
雖然已經晚了,但我正在從給定的答案中添加一些修改。 如果我們使用@Chef1075 解決方案,為了在去除多重共線性后獲得最佳集合,那么我們將丟失相關的變量。 我們只需要刪除其中之一。 為此,我使用@steve 回答提供了以下解決方案:
import pandas as pd
from sklearn.linear_model import LinearRegression
def sklearn_vif(exogs, data):
'''
This function calculates variance inflation function in sklearn way.
It is a comparatively faster process.
'''
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# form input data for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
X, y = data[not_exog], data[exog]
# extract r-squared from the fit
r_squared = LinearRegression().fit(X, y).score(X, y)
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4,1],
'b': [2, 2, 3, 2, 1,3],
'c': [4, 6, 7, 8, 9,5],
'd': [4, 3, 4, 5, 4,6],
'e': [8,8,14,15,17,20]}
)
df_vif= sklearn_vif(exogs=df.columns, data=df).sort_values(by='VIF',ascending=False)
while (df_vif.VIF>5).any() ==True:
red_df_vif= df_vif.drop(df_vif.index[0])
df= df[red_df_vif.index]
df_vif=sklearn_vif(exogs=df.columns,data=df).sort_values(by='VIF',ascending=False)
print(df)
d c b
0 4 4 2
1 3 6 2
2 4 7 3
3 5 8 2
4 4 9 1
5 6 5 3
這里使用數據框 python 的代碼:
import numpy as np
import scipy as sp
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d
cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
array([22.95, 3. , 12.95, 3. ])
另一個解決方案。 以下代碼給出了與 R car 包完全相同的 VIF 結果。
def calc_reg_return_vif(X, y):
"""
Utility function to calculate the VIF. This section calculates the linear
regression inverse R squared.
Parameters
----------
X : DataFrame
Input data.
y : Series
Target.
Returns
-------
vif : float
Calculated VIF value.
"""
X = X.values
y = y.values
if X.shape[1] == 1:
print("Note, there is only one predictor here")
X = X.reshape(-1, 1)
reg = LinearRegression().fit(X, y)
vif = 1 / (1 - reg.score(X, y))
return vif
def calc_vif_from_scratch(df):
"""
Calculating VIF using function from scratch
Parameters
----------
df : DataFrame
without target variable.
Returns
-------
vif : DataFrame
giving the feature - VIF value pair.
"""
vif = pd.DataFrame()
vif_list = []
for feature in list(df.columns):
y = df[feature]
X = df.drop(feature, axis="columns")
vif_list.append(calc_reg_return_vif(X, y))
vif["feature"] = df.columns
vif["VIF"] = vif_list
return vif
我已經在泰坦尼克號數據集上對其進行了測試。 您可以在此處獲取完整示例:https ://github.com/tulicsgabriel/Variance-Inflation-Factor-VIF-
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.