[英]How to include interactions of factors with h2o.interaction and h2o.glm in R h2o package
我希望使用h2o.glm
进行逻辑回归,包括因素之间的一些相互作用。 但是,简单地使用h2o.interaction
和h2o.interaction
后, h2o.glm
会在回归中包含太多虚拟变量。 这是一个可重现的示例。
# model.matrix function in R returns a matrix
# with the intercept, 1 dummy for Age, 1 dummy for Sex, and 1 dummy for Age:Sex
colnames(model.matrix(Survived ~ Age + Sex + Age:Sex, data = Titanic))
[1] "(Intercept)" "AgeAdult" "SexFemale" "AgeAdult:SexFemale"
# create an H2OFrame with the interaction of Age and Sex as a factor
library(h2o)
h2o.init()
Titanic.hex <- as.h2o(Titanic)
interact.hex <- h2o.cbind(Titanic.hex[,c("Survived","Age","Sex")]
,h2o.interaction(Titanic.hex
,factors = list(c("Age", "Sex"))
,pairwise = T
,max_factors = 99
,min_occurrence = 1))
# Age_Sex interaction column has 4 levels
h2o.levels(interact.hex$Age_Sex)
[1] "Child_Male" "Child_Female" "Adult_Male" "Adult_Female"
# Because Age_Sex interaction column has 4 levels
# we end up with 3 dummies to represent Age:Sex
interact.h2o.glm <- h2o.glm(2:ncol(interact.hex)
,"Survived"
,interact.hex
,family = 'binomial'
,lambda = 0)
h2o.varimp(interact.h2o.glm)$names
[1] "Age_Sex.Child_Female" "Age_Sex.Adult_Male" "Age_Sex.Adult_Female" "Sex.Male"
[5] "Age.Child" ""
什么是在因素与h2o之间进行交互以使h2o.glm
行为类似于model.matrix
的好方法? 在上面的示例中,我只希望看到1个用于Age
和Sex
之间交互的虚拟变量,而不是3个虚拟变量。
背景:您看到的是单点编码:线性模型只能处理数字,不能处理类别。 (也是深度学习。)因此,它为每个类别(即每个因子水平)都创建了一个布尔变量。 例如,如果性别为男性,则male_male将为1,否则为0;如果性别为女性,gender_female将为1,否则为0。 添加互动时,您会看到类别的每种可能组合都是布尔值。
H2O的深度学习算法将use_all_factor_levels
作为参数,默认为true。 如果将其设置为false,则其中的因素之一将隐式完成。 对于两级因素,这意味着您将只获得一列,例如,男性为0,女性为1。 这样可以减少您期望的字段。
不幸的是, h2o.glm()
目前没有该选项,据我h2o.interaction()
也没有。
您可以使用h2o.ifelse()
和h2o.cbind()
自己模拟它。 例如
interact.hex <- h2o.cbind(
Titanic.hex[,c("Class","Survived")],
h2o.ifelse(Titanic.hex$Age == "Adult", 1, 0),
h2o.ifelse(Titanic.hex$Sex == "Female", 1, 0)
)
interact.hex <- h2o.cbind(
interact.hex,
h2o.ifelse(interact.hex$C1 == 1 && interact.hex$C10 == 1, 1, 0)
)
但这有点乏味,不是吗,专栏可以在事后重命名。
在这里发布我自己的解决方法,即可满足我的需求。 但是,我仍然很高兴在这里看到一个更优雅或更内置的答案。
# create H2OFrame and interact as in the question
Titanic.hex <- as.h2o(Titanic)
interact.hex <- h2o.cbind(Titanic.hex[,c("Survived","Age","Sex")]
,h2o.interaction(Titanic.hex
,factors = list(c("Age", "Sex"))
,pairwise = T
,max_factors = 99
,min_occurrence = 1))
# Define a function that collapses interaction levels
collapse_level1_interacts <- function(df, column, col1, col2){
level1 <- rbind(
data.table::CJ(h2o.levels(df[,col1])[1], h2o.levels(df[,col2]))
,data.table::CJ(h2o.levels(df[,col1]), h2o.levels(df[,col2])[1]))
level1 <- paste(level1$V1, level1$V2, sep='_')
df[,column] <- h2o.ifelse(df[,column] %in% level1, '00000', df[,column])
return(df)
}
# Run the H2oFrame through the function
interact.hex2 <- collapse_level1_interacts(interact.hex, "Age_Sex", "Age", "Sex")
# Verify that we have only 2 levels for interaction
h2o.levels(interact.hex2$Age_Sex)
[1] "00000" "Child_Male"
# Verify that we have only 1 dummy for the interaction
interact.h2o.glm <- h2o.glm(2:ncol(interact.hex2)
,"Survived"
,interact.hex2
,family = 'binomial'
,lambda = 0)
h2o.varimp(interact.h2o.glm)$names
[1] "Age.Child" "Sex.Male" "Age_Sex.Child_Male" ""
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.