简体   繁体   English

具有虚拟/分类变量的线性回归

[英]Linear regression with dummy/categorical variables

I have a set of data. 我有一组数据。 I have use pandas to convert them in a dummy and categorical variables respectively. 我使用熊猫分别将它们转换为虚拟变量和分类变量。 So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. 那么,现在我想知道如何在Python中运行多元线性回归(我正在使用statsmodels)? Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? 是否有一些考虑因素,或者也许我必须指出我的代码中变量是虚拟的还是分类的? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit() ?. 也许变量的转换就足够了,我只需要运行回归model = sm.OLS(y, X).fit() ?。

My code is the following: 我的代码如下:

datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)

I get this: 我得到这个:

Age  Gender    Wage         Job         Classification 
32    Male  450000       Professor           High
28    Male  500000  Administrative           High
40  Female   20000       Professor            Low
47    Male   70000       Assistant         Medium
50  Female  345000       Professor         Medium
27  Female  156000       Assistant            Low
56    Male  432000  Administrative            Low
43  Female  100000  Administrative            Low

Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way: 然后我做:1 =男性,0 =女性,1:教授,2:行政,3:助理:

df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
        df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)

Getting this: 得到这个:

 Age  Gender    Wage             Job Classification  Sex_male  Job_index
 32    Male  450000       Professor           High         1          1
 28    Male  500000  Administrative           High         1          2
 40  Female   20000       Professor            Low         0          1
 47    Male   70000       Assistant         Medium         1          3
 50  Female  345000       Professor         Medium         0          1
 27  Female  156000       Assistant            Low         0          3
 56    Male  432000  Administrative            Low         1          2
 43  Female  100000  Administrative            Low         0          2

Now, if I would run a multiple linear regression, for example: 现在,如果我要运行多元线性回归,例如:

y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)

The result is shown normally, but would it be fine? 结果正常显示,但是可以吗? Or do I have to indicate somehow that the variables are dummy or categorical?. 还是我必须以某种方式表明变量是伪变量还是分类变量? Please help, I am new to Python and I want to learn. 请帮助,我是Python的新手,我想学习。 Greetings from South America - Chile. 来自南美-智利的问候。

In linear regression with categorical variables you should be careful of the Dummy Variable Trap. 使用分类变量进行线性回归时,应注意“虚拟变量陷阱”。 The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; 虚拟变量陷阱是其中自变量是多重共线性的场景,即两个或多个变量高度相关的场景。 in simple terms one variable can be predicted from the others. 简单来说,可以从其他变量中预测一个变量。 This can produce singularity of a model, meaning your model just won't work. 这会产生模型的奇异性,这意味着您的模型将无法工作。 Read about it here 在这里阅读

Idea is to use dummy variable encoding with drop_first=True , this will omit one column from each category after converting categorical variable into dummy/indicator variables. 想法是使用伪变量编码,其drop_first=True ,将类别变量转换为伪变量/指标变量后,将在每个类别中省略一列。 You WILL NOT lose and relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features. 通过这样做,您将不会丢失和相关的信息,这仅仅是因为数据集的所有点都可以由其余功能完全解释。

Here is complete code on how you can do it for your jobs dataset 这是有关如何为作业数据集执行此操作的完整代码

So you have your X features: 因此,您具有X功能:

Age, Gender, Job, Classification 

And one numerical features that you are trying to predict: 您尝试预测的一种数值特征:

Wage

First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this: 首先,您需要将初始数据集拆分为输入变量和预测,并假设其pandas数据框如下所示:

Input variables (your dataset is bit different but whole code remains the same, you will put every column from dataset in X, except one that will go to Y. pd.get_dummies works without problem like that - it will just convert categorical variables and it won't touch numerical): 输入变量(您的数据集有些不同,但整个代码保持不变,您将数据集中的每一列都放在X中,除了要转到Y的那一列。pd.get_dummies可以正常工作,不会发生问题-它将转换类别变量,不会涉及数字):

X = jobs[['Age','Gender','Job','Classification']]

Prediction: 预测:

Y = jobs['Wage']

Convert categorical variable into dummy/indicator variables and drop one in each category: 将类别变量转换为虚拟变量/指标变量,并在每个类别中添加一个:

X = pd.get_dummies(data=X, drop_first=True)

So now if you check shape of X (X.shape) with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables. 因此,现在如果您使用drop_first=True检查X的形状(X.shape),您会发现它的列减少了四列-每个分类变量对应一列。

You can now continue to use them in your linear model. 现在,您可以继续在线性模型中使用它们。 For scikit-learn implementation it could look like this: 对于scikit-learn实现,它可能如下所示:

from sklearn import linear_model
from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
        regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
        regr.fit(X_train, Y_train)
    predicted = regr.predict(X_test)

You'll need to indicate that either Job or Job_index is a categorical variable; 您需要指出JobJob_index是类别变量; otherwise, in the case of Job_index it will be treated as a continuous variable (which just happens to take values 1 , 2 , and 3 ), which isn't right. 否则,在的情况下Job_index它将被视为作为连续变量(这恰好取值12 ,和3 ),这是不正确的。

You can use a few different kinds of notation in statsmodels , here's the formula approach, which uses C() to indicate a categorical variable: 您可以在statsmodels使用几种不同的符号,这是公式方法,该方法使用C()来指示类别变量:

from statsmodels.formula.api import ols

fit = ols('Wage ~ C(Sex_male) + C(Job) + Age', data=df).fit() 

fit.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wage   R-squared:                       0.592
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     1.089
Date:                Wed, 06 Jun 2018   Prob (F-statistic):              0.492
Time:                        22:35:43   Log-Likelihood:                -104.59
No. Observations:                   8   AIC:                             219.2
Df Residuals:                       3   BIC:                             219.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             3.67e+05   3.22e+05      1.141      0.337   -6.57e+05    1.39e+06
C(Sex_male)[T.1]     2.083e+05   1.39e+05      1.498      0.231   -2.34e+05    6.51e+05
C(Job)[T.Assistant] -2.167e+05   1.77e+05     -1.223      0.309    -7.8e+05    3.47e+05
C(Job)[T.Professor] -9273.0556   1.61e+05     -0.058      0.958   -5.21e+05    5.03e+05
Age                 -3823.7419   6850.345     -0.558      0.616   -2.56e+04     1.8e+04
==============================================================================
Omnibus:                        0.479   Durbin-Watson:                   1.620
Prob(Omnibus):                  0.787   Jarque-Bera (JB):                0.464
Skew:                          -0.108   Prob(JB):                        0.793
Kurtosis:                       1.839   Cond. No.                         215.
==============================================================================

Note: Job and Job_index won't use the same categorical level as a baseline, so you'll see slightly different results for the dummy coefficients at each level, even though the overall model fit remains the same. 注意: JobJob_index不会使用与基线相同的分类级别,因此,即使总体模型拟合保持不变,每个级别的虚拟系数结果也会略有不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM