简体   繁体   中英

How to deal with co-linearity of dummy variables for linear regression?

I am using scikit-learn LogisticRegression on a dataset of household characteristics and trying to understand how to prepare the independent variables.

I have created binary dummy variables in place of categorical variables. eg The variable DWELLING_TYPE which had 3 possible values DetachedHouse , SemiDetached and Apartment has been replaced with 3 binary variables DWELLING_TYPE_DetachedHouse , DWELLING_TYPE_SemiDetached and DWELLING_TYPE_Apartment that each has the value 1 or 0`.

Clearly these 3 variables are co-dependent (co-linear?) because if one of these variables is 1 , the other 2 must be 0 . My understanding is that co-linearity should be minimised for Logistic Regression, so should I be omitting one of these variables from the input matrix?

Yes. It's a good practice. When you convert your categorical variables into dummies you can drop one of the dummies. It will reduce the redundancy from your input features.

In python you can do it by using pd.get_dummies

pd.get_dummies(df, columns=categorical_columns, drop_first=True)

setting drop_first parameter as True will work for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM