简体   繁体   English

线性回归如何处理虚拟变量的共线性?

[英]How to deal with co-linearity of dummy variables for linear regression?

I am using scikit-learn LogisticRegression on a dataset of household characteristics and trying to understand how to prepare the independent variables. 我在家庭特征数据集中使用scikit-learn LogisticRegression ,并试图了解如何准备自变量。

I have created binary dummy variables in place of categorical variables. 我创建了二进制虚拟变量来代替分类变量。 eg The variable DWELLING_TYPE which had 3 possible values DetachedHouse , SemiDetached and Apartment has been replaced with 3 binary variables DWELLING_TYPE_DetachedHouse , DWELLING_TYPE_SemiDetached and DWELLING_TYPE_Apartment that each has the value 1 or 0`. 例如可变DWELLING_TYPE其中有3个可能值DetachedHouseSemiDetachedApartment已被替换为3个二进制变量DWELLING_TYPE_DetachedHouseDWELLING_TYPE_SemiDetached和DWELLING_TYPE_Apartment that each has the value 1 or 0`。

Clearly these 3 variables are co-dependent (co-linear?) because if one of these variables is 1 , the other 2 must be 0 . 显然,这3个变量是相互依赖的(共线性的),因为如果这些变量之一为1 ,则其他2个必须为0 My understanding is that co-linearity should be minimised for Logistic Regression, so should I be omitting one of these variables from the input matrix? 我的理解是对于Logistic回归应将共线性最小化,那么我应该从输入矩阵中忽略这些变量之一吗?

Yes. 是。 It's a good practice. 这是一个好习惯。 When you convert your categorical variables into dummies you can drop one of the dummies. 将类别变量转换为虚拟变量时,可以删除其中一个虚拟变量。 It will reduce the redundancy from your input features. 它将减少输入功能的冗余。

In python you can do it by using pd.get_dummies 在python中,您可以使用pd.get_dummies

pd.get_dummies(df, columns=categorical_columns, drop_first=True)

setting drop_first parameter as True will work for you. drop_first参数设置为True即可为您工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM