简体   繁体   中英

Variable Importance Dummy Variables R

How can I determine variable importance (vip package in r) for categorical predictors when they have been one-hot encoded? It seems impossible for r to do this when the model is built on the dummy variables rather than the original categorical predictor.

I will demonstrate what I mean with the Ames Housing dataset. I am going to use two categorical predictors. Street (two levels) and Sale.Type (ten levels). I converted them from characters to factors.

library(AmesHousing)
df <- data.frame(ames_raw)

# convert characters to factors 
df <- df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
            data = df,
            method = "lm")

vip(mod_lm)

在此处输入图像描述

The variable importance ranks them by each level, rather than the original predictor. I can see StreetPave is important, but I cannot see if Street is important.

From the caret documentation, we see that variable importance in linear models corresponds to the absolute value of the t-statistic for each covariate. So, we can manually compute it, as I do in the code below.

lm() automatically converts categorical variables as dummies. So, to get the importance of each covariate, we have to sum over dummies. I did not find a way to automate this, so if you want to apply my solution to a different set of variables, you need to be careful in choosing the items of t.stats to be summed.

Finally, we can use results for plotting. I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization).

Ps when you provide a reproducible example, remember to load all the needed packages.

Pps summing over dummies may be sensitive to the base level of the dummy we are using (ie, the level we omit from the regression). I do not know if that could be an issue.

library(AmesHousing)
library(caret)
library(dplyr)

df = data.frame(ames_raw)

# convert characters to factors
df = df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
                data = df,
                method = "lm")

# Manually computing variable importance from t-statistics of the model.
t.stats = coef(summary(mod_lm))[, "t value"]
imp.sale = sum(t.stats[-(1:2)])
imp.street = t.stats[2]

# Plotting.
barplot(c(imp.sale, imp.street), names.arg = c("Sale", "Street"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM