In logistic regression, how do I set my 'reference level' for my dummy variables in python

Question

I'm doing a logistic regression model in python using statsmodel. Since a lot of my columns are categorical, I one-hot encoded them using "get_dummies". My new dataframe now has a lot more columns with 1's and 0's. (i,e gender1, gender2, stats1, status2, status 3 etc etc).

with this new dataframe, how do I set a 'reference level' for my logistic regression? by default, how do I know what my reference level is set at?

Answer 1

I am not 100% what your question is about, but in scikit-learn there exists the concept of a dummy regressor .

If you have a dataframe df, it works like this:

from sklearn.dummy import DummyRegressor
clf = DummyRegressor(strategy='mean', random_state=0)
clf = clf.fit(X, y)

There is also the DummyClassifier with import DummyClassifier . Check the docs, the idea is always to predict a baseline of the mean or the most frequent category.

Answer 2

A bit late to the party but... to set the reference level you can try the formula api : statsmodels.formula.api

The formula api uses Patsy to turn the formula string into a dataframe (in stats terminology a design matrix) that statsmodels can use. You might also find Patsy can handle most of the data shaping you need.

To set the reference level:

import statsmodel.formula.api as smf

log_reg = smf.logit("y ~ C(var, Treatment(reference='reference_value'))", data=df)

See: Handling Categorical Data

In logistic regression, how do I set my 'reference level' for my dummy variables in python

Question

2 answers

solution1
0 2019-11-06 10:12:29

solution2
0 2021-11-13 07:03:58

In logistic regression, how do I set my 'reference level' for my dummy variables in python

Question

2 answers

solution1 0 2019-11-06 10:12:29

solution2 0 2021-11-13 07:03:58

solution1
0 2019-11-06 10:12:29

solution2
0 2021-11-13 07:03:58