简体   繁体   English

R中分类变量的高效通用编码

[英]Efficient and generic recoding of categorical variables in R

Imagine a data.table like this 想象这样一个数据data.table

library(data.table)
DT = data.table(values=c('call', NA, 'letter', 'call', 'e-mail', 'phone'))
print(DT)

   values
1:   call
2:   <NA>
3: letter
4:   call
5: e-mail
6:  phone

I wish to recode the values by the following mapping 我希望通过以下映射重新编码值

mappings = list(
  'by_phone' = c('call', 'phone'),
  'by_web' = c('e-mail', 'web-meeting')
)

Ie I want to transform call into by_phone etc. NA should be put to missing and unknown (by the mapping provided) put to other . 即我想改造callby_phoneNA应该付诸missing和未知的(提供的映射)投入到other For this particular data table I could simply solve my problem by the following 对于这个特定的数据表,我可以通过以下方法简单地解决我的问题

recode_group <- function(values, mappings){
  ifelse(values %in% unlist(mappings[1]), names(mappings)[1], 
         ifelse(values %in% unlist(mappings[2]), names(mappings)[2], 
                ifelse(is.na(values), 'missing', 'other')
         )
    )
}
DT[, recoded_group:=recode_group(values, mappings)]
print(DT)

   values recoded_group
1:   call      by_phone
2:   <NA>       missing
3: letter         other
4:   call      by_phone
5: e-mail        by_web
6:  phone      by_phone

But I am looking for an efficient and generic recode_group functionality. 但我正在寻找一种有效且通用的recode_group功能。 Any suggestions? 有什么建议么?

Here's an option with an update-join approach: 这是一个使用update-join方法的选项:

DT[stack(mappings), on = "values", recoded_group := ind]
DT[is.na(values), recoded_group := "missing"]

DT
#   values recoded_group
#1:   call      by_phone
#2:     NA       missing
#3: letter            NA
#4:   call      by_phone
#5: e-mail        by_web
#6:  phone      by_phone

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM