I'm running a logistic regression on the Lalonde dataset to estimate propensity scores. I used the logit
function from statsmodels.statsmodels.formula.api
and wrapped the covariates with C()
to make them categorical. Treating age
and educ
as continuous variables results in successful convergence but making them categorical raises the error
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.617306
Iterations: 35
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-29-bae905b632a4> in <module>
----> 1 psmodel = fsms.logit('treatment ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)', tdf).fit()
2 tdf['ps'] = psmodel.predict()
3 tdf.head()
~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
1832 bnryfit = super(Logit, self).fit(start_params=start_params,
1833 method=method, maxiter=maxiter, full_output=full_output,
-> 1834 disp=disp, callback=callback, **kwargs)
1835
1836 discretefit = LogitResults(self, bnryfit)
~/venv/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
218 mlefit = super(DiscreteModel, self).fit(start_params=start_params,
219 method=method, maxiter=maxiter, full_output=full_output,
--> 220 disp=disp, callback=callback, **kwargs)
221
222 return mlefit # up to subclasses to wrap results
~/venv/lib/python3.7/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
471 Hinv = cov_params_func(self, xopt, retvals)
472 elif method == 'newton' and full_output:
--> 473 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
474 elif not skip_hessian:
475 H = -1 * self.hessian(xopt)
~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
549 signature = 'D->D' if isComplexType(t) else 'd->d'
550 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
552 return wrap(ainv.astype(result_t, copy=False))
553
~/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):
LinAlgError: Singular matrix
To reproduce, load the Lalonde dataset (you can write to csv from R data(lalonde)
) and run the following code
import numpy as np
import pandas as pd
from statsmodels.formula import api as fsms
filename = 'lalonde.csv'
df = pd.read_csv(filename)
tdf = df.drop(['re74', 're75', 'u74', 'u75'], axis=1)
formula = 'treat ~ 1 + C(age) + C(educ) + C(black) + C(hisp) + C(married) + C(nodegr)'
psmodel = fsms.logit(formula, tdf).fit()
Not sure why this failed to converge / got to singular Hessian during training.
Interestingly, some examples I found online about causal inference and the lalonde dataset don't make the variables categorical, which makes no sense to me. One example is the Microsoft DoWhy which uses LogisticRegression from sklearn out-of-the-box. It does not encode the variables to be categorical it seems.
There are other similar examples involving running logistic regression on Lalonde dataset without making the variables categorical. These are numeric in the data but the values should not be treated as continuous. At least I feel they should be put into bins if not one category per value. But that's a different question which is more appropriate on CrossValidated. Could someone help me understand why I got this error and what's the right way to get rid of it?
I got this error while running the following logistic model.
results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2 + const', data=df).fit()
print(results.summary())
After examining each variable, I found that one was, in fact, a constant.
df.const.value_counts()
1 100000
Name: Targeted, dtype: int64
Oops!
Upon removing it,
results = statsmodels.formula.api.logit('binary_outcome ~ x1 + x2', data=df).fit()
print(results.summary())
the logistic model ran as expected.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.