[英]How to remove one of the dummy variables in regression
Suppose, there is a categorical variable, education with different values as std_10, std_12, graduate, PG and Dr. The data set name is df and dependent variable is Income along with another independent continuous variable as Age. 假设有一个分类变量,具有不同值的教育为std_10,std_12,graduate,PG和Dr.数据集名称为df,因变量为Income,另一个独立连续变量为Age。 I can create a dummy variable using Python for ols regression, using the C(). 我可以使用Python为ols回归创建一个虚拟变量,使用C()。 However, I am unable to remove one of the dummy variable (eg want to remove graduate, and PG) which is insignificant and retain rest of the dummy variable. 但是,我无法删除其中一个虚拟变量(例如,想要删除毕业生和PG),这是无关紧要的,并保留虚拟变量的其余部分。
from statsmodels.formula.api import ols
fit = ols('Income ~ C(education) + Age', data=df).fit()
fit.summary()
I tried using the following code but am getting an error. 我尝试使用以下代码但收到错误。
fit = ols('Income ~ C(education[~[[graduate,PG]]) + Age', data=df).fit()
I want to exclude graduate and PG from the dummy variables and retain rest of the variables in my model. 我想从虚拟变量中排除研究生和PG,并在模型中保留其余变量。 Please help. 请帮忙。
I'm going to ignore your comment regarding: 我将忽略你对以下评论:
I don't want to convert it into numeric data. 我不想将其转换为数字数据。 It becomes difficult to explain to the client later on. 稍后很难向客户解释。
Assuming that your main priority is insight and not how you gain that insight, here's how I would do it: 假设你的主要优先事项是洞察力,而不是你如何获得洞察力,这就是我将如何做到这一点:
The challenge: 挑战:
Your main problem seems to be that your categorical data is gathered in a column, and not encoded as dummy variables. 您的主要问题似乎是您的分类数据是在列中收集的,而不是编码为虚拟变量。 So the gist of your challenge lies in recoding your data from a column of categorical variables to a collection of dummy variables. 因此,您面临的挑战在于将数据从一列分类变量重新编码为虚拟变量集合。 pd.get_dummies()
will do that for you in one line of code. pd.get_dummies()
将在一行代码中为您完成。 Afterwards you can extremely easily add and/or remove any variable you'd like in your final model. 之后,您可以非常轻松地添加和/或删除最终模型中您想要的任何变量。
Some data: 一些数据:
Since you haven't provided any sample data, here's a snippet that will produce a dataframe with some random data for Income Age, as well as some randomly placed education levels: 由于您尚未提供任何样本数据,因此这里的代码片段将生成一个包含Income Age的随机数据的数据框,以及一些随机放置的教育级别:
Snippet 1: 小片1:
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)
# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))
Output 1: 输出1:
Index Income Age education
45 103 60 std_12
46 108 60 PG
47 94 26 std_12
48 105 41 std_10
49 101 30 std_12
Now you can use pd.get_dummies()
to split the education column into multiple columns with each level as an individual column containing zeros and ones indicating whether or not the dummy variable occurs for a given index. 现在,您可以使用pd.get_dummies()
将教育列拆分为多个列,每个级别作为包含零的单个列,以及指示给定索引是否出现虚拟变量的列。
Snippet 2: 摘录2:
# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))
Output 2: 输出2:
Index Income Age education d_Dr d_Graduate d_PG d_std_10 d_std_12
45 103 60 std_12 0 0 0 0 1
46 108 60 PG 0 0 1 0 0
47 94 26 std_12 0 0 0 0 1
48 105 41 std_10 0 0 0 1 0
49 101 30 std_12 0 0 0 0 1
And now you can easily see which dummy variables are significant and chose whether or not to keep them in your analysis: 现在,您可以轻松查看哪些虚拟变量具有重要性,并选择是否将它们保留在分析中:
Snippet 3: 代码段3:
# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()
Output 3: 输出3:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Age -0.0123 0.075 -0.165 0.870 -0.163 0.138
d_Dr 98.8509 3.759 26.300 0.000 91.276 106.426
d_Graduate 98.5567 4.684 21.042 0.000 89.117 107.996
d_PG 97.0613 4.109 23.622 0.000 88.780 105.342
d_std_10 100.2472 3.554 28.209 0.000 93.085 107.409
d_std_12 98.3209 3.804 25.845 0.000 90.654 105.988
To no surprise, all dummy variables are insignificant since we're using a (small) random sample, but you could choose to remove the least significant variables and rerun your analysis like this: 毫不奇怪,所有虚拟变量都是微不足道的,因为我们使用(小)随机样本,但您可以选择删除最不重要的变量并重新运行您的分析,如下所示:
Snippet 4: 小片4:
# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()
Output 4: 输出4:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Age 1.9771 0.123 16.011 0.000 1.729 2.226
d_Dr 11.0105 9.601 1.147 0.257 -8.316 30.337
d_Graduate 8.5356 15.304 0.558 0.580 -22.270 39.341
d_PG 6.2942 11.543 0.545 0.588 -16.940 29.529
I hope this is something you can use. 我希望这是你可以使用的东西。 Don't hesitate to let me know if not. 如果没有,请随时告诉我。
Here's the whole thing for an easy copy&paste: 这是一个简单的复制和粘贴的全部内容:
#%%
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)
# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))
#%%
# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))
# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()
# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.