簡體   English   中英

如何在R h2o包中包括因子與h2o.interaction和h2o.glm的交互

[英]How to include interactions of factors with h2o.interaction and h2o.glm in R h2o package

我希望使用h2o.glm進行邏輯回歸,包括因素之間的一些相互作用。 但是,簡單地使用h2o.interactionh2o.interaction后, h2o.glm會在回歸中包含太多虛擬變量。 這是一個可重現的示例。

# model.matrix function in R returns a matrix 
# with the intercept, 1 dummy for Age, 1 dummy for Sex, and 1 dummy for Age:Sex
colnames(model.matrix(Survived ~ Age + Sex + Age:Sex, data = Titanic))
[1] "(Intercept)"        "AgeAdult"           "SexFemale"          "AgeAdult:SexFemale"

# create an H2OFrame with the interaction of Age and Sex as a factor
library(h2o)
h2o.init()
Titanic.hex <- as.h2o(Titanic)
interact.hex <- h2o.cbind(Titanic.hex[,c("Survived","Age","Sex")]
                          ,h2o.interaction(Titanic.hex
                          ,factors = list(c("Age", "Sex"))
                          ,pairwise = T
                          ,max_factors = 99
                          ,min_occurrence = 1))

# Age_Sex interaction column has 4 levels
h2o.levels(interact.hex$Age_Sex)
[1] "Child_Male"   "Child_Female" "Adult_Male"   "Adult_Female"

# Because Age_Sex interaction column has 4 levels 
# we end up with 3 dummies to represent Age:Sex
interact.h2o.glm <- h2o.glm(2:ncol(interact.hex)
                            ,"Survived"
                            ,interact.hex
                            ,family = 'binomial'
                            ,lambda = 0)
h2o.varimp(interact.h2o.glm)$names
[1] "Age_Sex.Child_Female" "Age_Sex.Adult_Male"   "Age_Sex.Adult_Female" "Sex.Male"            
[5] "Age.Child"            ""

什么是在因素與h2o之間進行交互以使h2o.glm行為類似於model.matrix的好方法? 在上面的示例中,我只希望看到1個用於AgeSex之間交互的虛擬變量,而不是3個虛擬變量。

背景:您看到的是單點編碼:線性模型只能處理數字,不能處理類別。 (也是深度學習。)因此,它為每個類別(即每個因子水平)都創建了一個布爾變量。 例如,如果性別為男性,則male_male將為1,否則為0;如果性別為女性,gender_female將為1,否則為0。 添加互動時,您會看到類別的每種可能組合都是布爾值。

H2O的深度學習算法將use_all_factor_levels作為參數,默認為true。 如果將其設置為false,則其中的因素之一將隱式完成。 對於兩級因素,這意味着您將只獲得一列,例如,男性為0,女性為1。 這樣可以減少您期望的字段。

不幸的是, h2o.glm()目前沒有該選項,據我h2o.interaction()也沒有。

您可以使用h2o.ifelse()h2o.cbind()自己模擬它。 例如

interact.hex <- h2o.cbind(
  Titanic.hex[,c("Class","Survived")],
  h2o.ifelse(Titanic.hex$Age == "Adult", 1, 0),
  h2o.ifelse(Titanic.hex$Sex == "Female", 1, 0)
  )
interact.hex <- h2o.cbind(
  interact.hex,
  h2o.ifelse(interact.hex$C1 == 1 && interact.hex$C10 == 1, 1, 0)
)

但這有點乏味,不是嗎,專欄可以在事后重命名。

在這里發布我自己的解決方法,即可滿足我的需求。 但是,我仍然很高興在這里看到一個更優雅或更內置的答案。

# create H2OFrame and interact as in the question
Titanic.hex <- as.h2o(Titanic)
interact.hex <- h2o.cbind(Titanic.hex[,c("Survived","Age","Sex")]
                          ,h2o.interaction(Titanic.hex
                          ,factors = list(c("Age", "Sex"))
                          ,pairwise = T
                          ,max_factors = 99
                          ,min_occurrence = 1))

# Define a function that collapses interaction levels
collapse_level1_interacts <- function(df, column, col1, col2){
  level1 <- rbind(
    data.table::CJ(h2o.levels(df[,col1])[1], h2o.levels(df[,col2]))
    ,data.table::CJ(h2o.levels(df[,col1]), h2o.levels(df[,col2])[1]))
    level1 <- paste(level1$V1, level1$V2, sep='_')
    df[,column] <- h2o.ifelse(df[,column] %in% level1, '00000', df[,column])
    return(df)
}

# Run the H2oFrame through the function
interact.hex2 <- collapse_level1_interacts(interact.hex, "Age_Sex", "Age", "Sex")

# Verify that we have only 2 levels for interaction
h2o.levels(interact.hex2$Age_Sex)
[1] "00000"      "Child_Male"

# Verify that we have only 1 dummy for the interaction
interact.h2o.glm <- h2o.glm(2:ncol(interact.hex2)
                            ,"Survived"
                            ,interact.hex2
                            ,family = 'binomial'
                            ,lambda = 0)
h2o.varimp(interact.h2o.glm)$names
[1] "Age.Child"          "Sex.Male"           "Age_Sex.Child_Male" ""

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM