LinAlgError: Singular matrix from Statsmodels logistic regression

Question

I'm running a logistic regression on the Lalonde dataset to estimate propensity scores. I used the logit function from statsmodels.statsmodels.formula.api and wrapped the covariates with C() to make them categorical. Treating age and educ as continuous variables results in successful convergence but making them categorical raises the error

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.617306
         Iterations: 35
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-29-bae905b632a4> in <module>
----> 1 psmodel = fsms.logit('treatment ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)', tdf).fit()
      2 tdf['ps'] = psmodel.predict()
      3 tdf.head()

~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1832         bnryfit = super(Logit, self).fit(start_params=start_params,
   1833                 method=method, maxiter=maxiter, full_output=full_output,
-> 1834                 disp=disp, callback=callback, **kwargs)
   1835 
   1836         discretefit = LogitResults(self, bnryfit)

~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    218         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    219                 method=method, maxiter=maxiter, full_output=full_output,
--> 220                 disp=disp, callback=callback, **kwargs)
    221 
    222         return mlefit # up to subclasses to wrap results

~/venv/lib/python3.7/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    471             Hinv = cov_params_func(self, xopt, retvals)
    472         elif method == 'newton' and full_output:
--> 473             Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    474         elif not skip_hessian:
    475             H = -1 * self.hessian(xopt)

~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
    549     signature = 'D->D' if isComplexType(t) else 'd->d'
    550     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    552     return wrap(ainv.astype(result_t, copy=False))
    553 

~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     95 
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

To reproduce, load the Lalonde dataset (you can write to csv from R data(lalonde) ) and run the following code

import numpy as np
import pandas as pd
from statsmodels.formula import api as fsms

filename = 'lalonde.csv'
df = pd.read_csv(filename)
tdf = df.drop(['re74', 're75', 'u74', 'u75'], axis=1)
formula = 'treat ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)'
psmodel = fsms.logit(formula, tdf).fit()

Not sure why this failed to converge / got to singular Hessian during training.

Interestingly, some examples I found online about causal inference and the lalonde dataset don't make the variables categorical, which makes no sense to me. One example is the Microsoft DoWhy which uses LogisticRegression from sklearn out-of-the-box. It does not encode the variables to be categorical it seems.

There are other similar examples involving running logistic regression on Lalonde dataset without making the variables categorical. These are numeric in the data but the values should not be treated as continuous. At least I feel they should be put into bins if not one category per value. But that's a different question which is more appropriate on CrossValidated. Could someone help me understand why I got this error and what's the right way to get rid of it?

Answer 1

I got this error while running the following logistic model.

results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2 + const', data=df).fit()
print(results.summary())

After examining each variable, I found that one was, in fact, a constant.

df.const.value_counts()

1    100000
Name: Targeted, dtype: int64

Oops!

Upon removing it,

results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2', data=df).fit()
print(results.summary())

the logistic model ran as expected.

LinAlgError: Singular matrix from Statsmodels logistic regression

Question

1 answers

solution1
0 2022-06-29 16:34:46

LinAlgError: Singular matrix from Statsmodels logistic regression

Question

1 answers

solution1 0 2022-06-29 16:34:46

solution1
0 2022-06-29 16:34:46