简体   繁体   English

LinAlgError:来自 Statsmodels 逻辑回归的奇异矩阵

[英]LinAlgError: Singular matrix from Statsmodels logistic regression

I'm running a logistic regression on the Lalonde dataset to estimate propensity scores.我正在对 Lalonde 数据集进行逻辑回归以估计倾向得分。 I used the logit function from statsmodels.statsmodels.formula.api and wrapped the covariates with C() to make them categorical.我使用了statsmodels.statsmodels.formula.api中的logit函数,并用C()包装了协变量以使它们分类。 Treating age and educ as continuous variables results in successful convergence but making them categorical raises the errorageeduc视为连续变量会导致成功收敛,但将它们设为分类会引发错误

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.617306
         Iterations: 35
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-29-bae905b632a4> in <module>
----> 1 psmodel = fsms.logit('treatment ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)', tdf).fit()
      2 tdf['ps'] = psmodel.predict()
      3 tdf.head()

~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1832         bnryfit = super(Logit, self).fit(start_params=start_params,
   1833                 method=method, maxiter=maxiter, full_output=full_output,
-> 1834                 disp=disp, callback=callback, **kwargs)
   1835 
   1836         discretefit = LogitResults(self, bnryfit)

~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    218         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    219                 method=method, maxiter=maxiter, full_output=full_output,
--> 220                 disp=disp, callback=callback, **kwargs)
    221 
    222         return mlefit # up to subclasses to wrap results

~/venv/lib/python3.7/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    471             Hinv = cov_params_func(self, xopt, retvals)
    472         elif method == 'newton' and full_output:
--> 473             Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    474         elif not skip_hessian:
    475             H = -1 * self.hessian(xopt)

~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
    549     signature = 'D->D' if isComplexType(t) else 'd->d'
    550     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    552     return wrap(ainv.astype(result_t, copy=False))
    553 

~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     95 
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

To reproduce, load the Lalonde dataset (you can write to csv from R data(lalonde) ) and run the following code要重现,请加载Lalonde 数据集(您可以从 R data(lalonde)写入 csv)并运行以下代码

import numpy as np
import pandas as pd
from statsmodels.formula import api as fsms

filename = 'lalonde.csv'
df = pd.read_csv(filename)
tdf = df.drop(['re74', 're75', 'u74', 'u75'], axis=1)
formula = 'treat ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)'
psmodel = fsms.logit(formula, tdf).fit()

Not sure why this failed to converge / got to singular Hessian during training.不知道为什么这在训练期间未能收敛/达到奇异 Hessian。

Interestingly, some examples I found online about causal inference and the lalonde dataset don't make the variables categorical, which makes no sense to me.有趣的是,我在网上找到的一些关于因果推理的例子和 lalonde 数据集并没有使变量分类,这对我来说毫无意义。 One example is the Microsoft DoWhy which uses LogisticRegression from sklearn out-of-the-box.一个例子是Microsoft DoWhy ,它使用了开箱即用的 sklearn 的 LogisticRegression。 It does not encode the variables to be categorical it seems.它不会将变量编码为看起来是分类的。

There are other similar examples involving running logistic regression on Lalonde dataset without making the variables categorical.还有其他类似的例子涉及在 Lalonde 数据集上运行逻辑回归而不使变量分类。 These are numeric in the data but the values should not be treated as continuous.这些是数据中的数字,但不应将这些值视为连续值。 At least I feel they should be put into bins if not one category per value.至少我觉得如果每个值不是一个类别,它们应该被放入垃圾箱。 But that's a different question which is more appropriate on CrossValidated.但这是一个不同的问题,在 CrossValidated 上更合适。 Could someone help me understand why I got this error and what's the right way to get rid of it?有人可以帮助我理解为什么会出现此错误以及摆脱它的正确方法是什么?

I got this error while running the following logistic model.运行以下逻辑模型时出现此错误。

results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2 + const', data=df).fit()
print(results.summary())

After examining each variable, I found that one was, in fact, a constant.在检查了每个变量之后,我发现其中一个实际上是一个常数。

df.const.value_counts()

1    100000
Name: Targeted, dtype: int64

Oops!哎呀!

Upon removing it,取下来后,

results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2', data=df).fit()
print(results.summary())

the logistic model ran as expected.逻辑模型按预期运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM