I am using scikit-learn
LogisticRegression
on a dataset of household characteristics and trying to understand how to prepare the independent variables.
I have created binary dummy variables in place of categorical variables. eg The variable DWELLING_TYPE
which had 3 possible values DetachedHouse
, SemiDetached
and Apartment
has been replaced with 3 binary variables DWELLING_TYPE_DetachedHouse
, DWELLING_TYPE_SemiDetached
and DWELLING_TYPE_Apartment that each has the value
1 or
0`.
Clearly these 3 variables are co-dependent (co-linear?) because if one of these variables is 1
, the other 2 must be 0
. My understanding is that co-linearity should be minimised for Logistic Regression, so should I be omitting one of these variables from the input matrix?
Yes. It's a good practice. When you convert your categorical variables into dummies you can drop one of the dummies. It will reduce the redundancy from your input features.
In python you can do it by using pd.get_dummies
pd.get_dummies(df, columns=categorical_columns, drop_first=True)
setting drop_first parameter as True will work for you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.