简体   繁体   中英

In logistic regression, how do I set my 'reference level' for my dummy variables in python

I'm doing a logistic regression model in python using statsmodel. Since a lot of my columns are categorical, I one-hot encoded them using "get_dummies". My new dataframe now has a lot more columns with 1's and 0's. (i,e gender1, gender2, stats1, status2, status 3 etc etc).

with this new dataframe, how do I set a 'reference level' for my logistic regression? by default, how do I know what my reference level is set at?

I am not 100% what your question is about, but in scikit-learn there exists the concept of a dummy regressor .

If you have a dataframe df, it works like this:

from sklearn.dummy import DummyRegressor
clf = DummyRegressor(strategy='mean', random_state=0)
clf = clf.fit(X, y)

There is also the DummyClassifier with import DummyClassifier . Check the docs, the idea is always to predict a baseline of the mean or the most frequent category.

A bit late to the party but... to set the reference level you can try the formula api : statsmodels.formula.api

The formula api uses Patsy to turn the formula string into a dataframe (in stats terminology a design matrix) that statsmodels can use. You might also find Patsy can handle most of the data shaping you need.

To set the reference level:

import statsmodel.formula.api as smf

log_reg = smf.logit("y ~ C(var, Treatment(reference='reference_value'))", data=df)

See: Handling Categorical Data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM