简体   繁体   中英

How to remove one of the dummy variables in regression

Suppose, there is a categorical variable, education with different values as std_10, std_12, graduate, PG and Dr. The data set name is df and dependent variable is Income along with another independent continuous variable as Age. I can create a dummy variable using Python for ols regression, using the C(). However, I am unable to remove one of the dummy variable (eg want to remove graduate, and PG) which is insignificant and retain rest of the dummy variable.

from statsmodels.formula.api import ols
fit = ols('Income ~ C(education) +  Age', data=df).fit() 
fit.summary()

I tried using the following code but am getting an error.

fit = ols('Income ~ C(education[~[[graduate,PG]]) +  Age', data=df).fit() 

I want to exclude graduate and PG from the dummy variables and retain rest of the variables in my model. Please help.

I'm going to ignore your comment regarding:

I don't want to convert it into numeric data. It becomes difficult to explain to the client later on.

Assuming that your main priority is insight and not how you gain that insight, here's how I would do it:


The challenge:

Your main problem seems to be that your categorical data is gathered in a column, and not encoded as dummy variables. So the gist of your challenge lies in recoding your data from a column of categorical variables to a collection of dummy variables. pd.get_dummies() will do that for you in one line of code. Afterwards you can extremely easily add and/or remove any variable you'd like in your final model.

Some data:

Since you haven't provided any sample data, here's a snippet that will produce a dataframe with some random data for Income Age, as well as some randomly placed education levels:

Snippet 1:

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)

# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))

Output 1:

Index  Income  Age   education
45     103     60    std_12
46     108     60        PG
47      94     26    std_12
48     105     41    std_10
49     101     30    std_12

Now you can use pd.get_dummies() to split the education column into multiple columns with each level as an individual column containing zeros and ones indicating whether or not the dummy variable occurs for a given index.

Snippet 2:

# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))

Output 2:

Index   Income  Age education  d_Dr  d_Graduate  d_PG  d_std_10  d_std_12
45      103   60    std_12     0           0     0         0         1
46      108   60        PG     0           0     1         0         0
47       94   26    std_12     0           0     0         0         1
48      105   41    std_10     0           0     0         1         0
49      101   30    std_12     0           0     0         0         1

And now you can easily see which dummy variables are significant and chose whether or not to keep them in your analysis:

Snippet 3:

# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()

Output 3:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age           -0.0123      0.075     -0.165      0.870      -0.163       0.138
d_Dr          98.8509      3.759     26.300      0.000      91.276     106.426
d_Graduate    98.5567      4.684     21.042      0.000      89.117     107.996
d_PG          97.0613      4.109     23.622      0.000      88.780     105.342
d_std_10     100.2472      3.554     28.209      0.000      93.085     107.409
d_std_12      98.3209      3.804     25.845      0.000      90.654     105.988

To no surprise, all dummy variables are insignificant since we're using a (small) random sample, but you could choose to remove the least significant variables and rerun your analysis like this:

Snippet 4:

# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()

Output 4:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Age            1.9771      0.123     16.011      0.000       1.729       2.226
d_Dr          11.0105      9.601      1.147      0.257      -8.316      30.337
d_Graduate     8.5356     15.304      0.558      0.580     -22.270      39.341
d_PG           6.2942     11.543      0.545      0.588     -16.940      29.529

I hope this is something you can use. Don't hesitate to let me know if not.


Here's the whole thing for an easy copy&paste:

#%%
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Sample data
np.random.seed(123)
rows = 50
dfx = pd.DataFrame(np.random.randint(90,110,size=(rows, 1)), columns=['Income'])
dfy = pd.DataFrame(np.random.randint(25,68,size=(rows, 1)), columns=['Age'])
df = pd.concat([dfx,dfy], axis = 1)

# Categorical column
dummyVars = ['std_10', 'std_12', 'Graduate', 'PG', 'Dr']
df['education'] = np.random.choice(dummyVars, len(df))
print(df.tail(5))
#%%

# Split dummy variables
df = pd.concat([df, pd.get_dummies(df['education'].astype('category'), prefix = 'd')], axis = 1)
print(df.tail(5))

# Explanatory variables, subset 1
regression1 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG', 'd_std_10', 'd_std_12']]).fit()
regression1.summary()

# Explanatory variables, subset 2
regression2 = sm.OLS(df['Income'], df[['Age', 'd_Dr', 'd_Graduate', 'd_PG']]).fit()
regression2.summary()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM