简体   繁体   English

如何在回归中删除其中一个虚拟变量

[英]How to remove one of the dummy variables in regression

Suppose, there is a categorical variable, education with different values as std_10, std_12, graduate, PG and Dr. The data set name is df and dependent variable is Income along with another independent continuous variable as Age. 假设有一个分类变量,具有不同值的教育为std_10,std_12,graduate,PG和Dr.数据集名称为df,因变量为Income,另一个独立连续变量为Age。 I can create a dummy variable using Python for ols regression, using the C(). 我可以使用Python为ols回归创建一个虚拟变量,使用C()。 However, I am unable to remove one of the dummy variable (eg want to remove graduate, and PG) which is insignificant and retain rest of the dummy variable. 但是,我无法删除其中一个虚拟变量(例如,想要删除毕业生和PG),这是无关紧要的,并保留虚拟变量的其余部分。

from statsmodels.formula.api import ols
fit = ols('Income ~ C(education) +  Age', data=df).fit() 
fit.summary()

I tried using the following code but am getting an error. 我尝试使用以下代码但收到错误。

fit = ols('Income ~ C(education[~[[graduate,PG]]) +  Age', data=df).fit() 

I want to exclude graduate and PG from the dummy variables and retain rest of the variables in my model. 我想从虚拟变量中排除研究生和PG,并在模型中保留其余变量。 Please help. 请帮忙。

I'm going to ignore your comment regarding: 我将忽略你对以下评论:

I don't want to convert it into numeric data. 我不想将其转换为数字数据。 It becomes difficult to explain to the client later on. 稍后很难向客户解释。

Assuming that your main priority is insight and not how you gain that insight, here's how I would do it: 假设你的主要优先事项是洞察力,而不是你如何获得洞察力,这就是我将如何做到这一点:


The challenge: 挑战:

Your main problem seems to be that your categorical data is gathered in a column, and not encoded as dummy variables. 您的主要问题似乎是您的分类数据是在列中收集的,而不是编码为虚拟变量。 So the gist of your challenge lies in recoding your data from a column of categorical variables to a collection of dummy variables. 因此,您面临的挑战在于将数据从一列分类变量重新编码为虚拟变量集合。 pd.get_dummies() will do that for you in one line of code. pd.get_dummies()将在一行代码中为您完成。 Afterwards you can extremely easily add and/or remove any variable you'd like in your final model. 之后,您可以非常轻松地添加和/或删除最终模型中您想要的任何变量。

Some data: 一些数据:

Since you haven't provided any sample data, here's a snippet that will produce a dataframe with some random data for Income Age, as well as some randomly placed education levels: 由于您尚未提供任何样本数据,因此这里的代码片段将生成一个包含Income Age的随机数据的数据框,以及一些随机放置的教育级别:

Snippet 1: 小片1:

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)

# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))

Output 1: 输出1:

Index  Income  Age   education
45     103     60    std_12
46     108     60        PG
47      94     26    std_12
48     105     41    std_10
49     101     30    std_12

Now you can use pd.get_dummies() to split the education column into multiple columns with each level as an individual column containing zeros and ones indicating whether or not the dummy variable occurs for a given index. 现在,您可以使用pd.get_dummies()将教育列拆分为多个列,每个级别作为包含零的单个列,以及指示给定索引是否出现虚拟变量的列。

Snippet 2: 摘录2:

# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))

Output 2: 输出2:

Index   Income  Age education  d_Dr  d_Graduate  d_PG  d_std_10  d_std_12
45      103   60    std_12     0           0     0         0         1
46      108   60        PG     0           0     1         0         0
47       94   26    std_12     0           0     0         0         1
48      105   41    std_10     0           0     0         1         0
49      101   30    std_12     0           0     0         0         1

And now you can easily see which dummy variables are significant and chose whether or not to keep them in your analysis: 现在,您可以轻松查看哪些虚拟变量具有重要性,并选择是否将它们保留在分析中:

Snippet 3: 代码段3:

# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()

Output 3: 输出3:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age           -0.0123      0.075     -0.165      0.870      -0.163       0.138
d_Dr          98.8509      3.759     26.300      0.000      91.276     106.426
d_Graduate    98.5567      4.684     21.042      0.000      89.117     107.996
d_PG          97.0613      4.109     23.622      0.000      88.780     105.342
d_std_10     100.2472      3.554     28.209      0.000      93.085     107.409
d_std_12      98.3209      3.804     25.845      0.000      90.654     105.988

To no surprise, all dummy variables are insignificant since we're using a (small) random sample, but you could choose to remove the least significant variables and rerun your analysis like this: 毫不奇怪,所有虚拟变量都是微不足道的,因为我们使用(小)随机样本,但您可以选择删除不重要的变量并重新运行您的分析,如下所示:

Snippet 4: 小片4:

# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()

Output 4: 输出4:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age            1.9771      0.123     16.011      0.000       1.729       2.226
d_Dr          11.0105      9.601      1.147      0.257      -8.316      30.337
d_Graduate     8.5356     15.304      0.558      0.580     -22.270      39.341
d_PG           6.2942     11.543      0.545      0.588     -16.940      29.529

I hope this is something you can use. 我希望这是你可以使用的东西。 Don't hesitate to let me know if not. 如果没有,请随时告诉我。


Here's the whole thing for an easy copy&paste: 这是一个简单的复制和粘贴的全部内容:

#%%
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)

# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))
#%%

# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))

# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()

# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM