简体   繁体   English

变量重要性虚拟变量 R

[英]Variable Importance Dummy Variables R

How can I determine variable importance (vip package in r) for categorical predictors when they have been one-hot encoded?如何确定分类预测变量的变量重要性(r 中的 vip package)? It seems impossible for r to do this when the model is built on the dummy variables rather than the original categorical predictor.当 model 建立在虚拟变量而不是原始分类预测变量上时,r 似乎不可能做到这一点。

I will demonstrate what I mean with the Ames Housing dataset.我将用 Ames Housing 数据集演示我的意思。 I am going to use two categorical predictors.我将使用两个分类预测变量。 Street (two levels) and Sale.Type (ten levels). Street(两层)和 Sale.Type(十层)。 I converted them from characters to factors.我将它们从字符转换为因子。

library(AmesHousing)
df <- data.frame(ames_raw)

# convert characters to factors 
df <- df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
            data = df,
            method = "lm")

vip(mod_lm)

在此处输入图像描述

The variable importance ranks them by each level, rather than the original predictor.变量重要性按每个级别而不是原始预测变量对它们进行排名。 I can see StreetPave is important, but I cannot see if Street is important.我可以看到 StreetPave 很重要,但我看不出 Street 是否重要。

From the caret documentation, we see that variable importance in linear models corresponds to the absolute value of the t-statistic for each covariate.caret文档中,我们看到线性模型中的变量重要性对应于每个协变量的 t 统计量的绝对值。 So, we can manually compute it, as I do in the code below.所以,我们可以手动计算它,就像我在下面的代码中所做的那样。

lm() automatically converts categorical variables as dummies. lm()自动将分类变量转换为虚拟变量。 So, to get the importance of each covariate, we have to sum over dummies.因此,为了获得每个协变量的重要性,我们必须对虚拟变量求和。 I did not find a way to automate this, so if you want to apply my solution to a different set of variables, you need to be careful in choosing the items of t.stats to be summed.我没有找到自动化的方法,所以如果你想将我的解决方案应用于不同的变量集,你需要小心选择t.stats的项目来求和。

Finally, we can use results for plotting.最后,我们可以使用结果进行绘图。 I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization). I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization).

Ps when you provide a reproducible example, remember to load all the needed packages. Ps 当您提供可重现的示例时,请记住加载所有需要的包。

Pps summing over dummies may be sensitive to the base level of the dummy we are using (ie, the level we omit from the regression). Pps 对虚拟对象求和可能对我们正在使用的虚拟对象的基本水平(即,我们从回归中省略的水平)敏感。 I do not know if that could be an issue.我不知道这是否是个问题。

library(AmesHousing)
library(caret)
library(dplyr)

df = data.frame(ames_raw)

# convert characters to factors
df = df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
                data = df,
                method = "lm")

# Manually computing variable importance from t-statistics of the model.
t.stats = coef(summary(mod_lm))[, "t value"]
imp.sale = sum(t.stats[-(1:2)])
imp.street = t.stats[2]

# Plotting.
barplot(c(imp.sale, imp.street), names.arg = c("Sale", "Street"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM