简体   繁体   English

为 data.table r 中的多列创建汇总变量

[英]Create an summarizing variable for multiple columns in data.table r

I have the following data.table我有以下data.table

dt <- data.table(id=c(1,2,2,2,3,3,4),
                 date=c("2019-09-13", "2018-12-06", "2017-12-14", "2018-02-08", "2015-12-06", "2012-12-14", "2011-02-08"),
                 variable_1=c("a","b",NA,NA,"b","c",NA),
                 variable_2=c(NA,NA,"a",NA,"a","c",NA),
                 variable_3=c(NA,NA,NA,"b","c","c",NA))
dt
id       date variable_1 variable_2 variable_3
1:  1 2019-09-13          a       <NA>       <NA>
2:  2 2018-12-06          b       <NA>       <NA>
3:  2 2017-12-14       <NA>          a       <NA>
4:  2 2018-02-08       <NA>       <NA>          b
5:  3 2015-12-06          b          a          c
6:  3 2012-12-14          c          c          c
7:  4 2011-02-08       <NA>       <NA>       <NA>

I want to create a variable y that is summarizing all the columns.我想创建一个汇总所有列的变量y Everything that has one .is.na() among the variable should be 0 .变量中有一个.is.na()的所有东西都应该是0 Every row that has only is.na among all the variables should be 1 .所有变量中只有is.na的每一行都应该是1 Like this:像这样:

   id       date variable_1 variable_2 variable_3 y
1:  1 2019-09-13          a       <NA>       <NA> 0
2:  2 2018-12-06          b       <NA>       <NA> 0
3:  2 2017-12-14       <NA>          a       <NA> 0
4:  2 2018-02-08       <NA>       <NA>          b 0
5:  3 2015-12-06          b          a          c 0
6:  3 2012-12-14          c          c          c 0
7:  4 2011-02-08       <NA>       <NA>       <NA> 1

In the original data.table I have 22 variables that I am looking at among 830 total variables.在原始data.table中,我在 830 个总变量中查看了 22 个变量。 So I would prefer not to look for every Variable with _1 to _22 separately.因此,我不希望分别查找具有_1_22的每个变量。 Is there a way in data.table ? data.table有办法吗?

dt[, y := +(rowSums(!is.na(.SD)) == 0L), .SDcols = patterns("^variable_")]
#    id       date variable_1 variable_2 variable_3 y
# 1:  1 2019-09-13          a       <NA>       <NA> 0
# 2:  2 2018-12-06          b       <NA>       <NA> 0
# 3:  2 2017-12-14       <NA>          a       <NA> 0
# 4:  2 2018-02-08       <NA>       <NA>          b 0
# 5:  3 2015-12-06          b          a          c 0
# 6:  3 2012-12-14          c          c          c 0
# 7:  4 2011-02-08       <NA>       <NA>       <NA> 1

Walk-through:演练:

  • .SDcols=patterns(...) defines the columns to be processed as .SD in the j component. .SDcols=patterns(...)将要处理的列定义为j组件中的.SD This doesn't involve removing / selecting columns for the output, just the ones that will be referenced internally.这不涉及删除/选择output 的列,仅涉及将在内部引用的列。
  • .is.na(.SD) returns a logical matrix , same dims as .SD , indicating if its value is NA . .is.na(.SD)返回一个logical matrix ,与.SD相同,表示其值是否为NA
  • rowSums(...) returns the count of non- NA s in the row. rowSums(...)返回行中非NA的计数。
  • using the inverted logic of "count the number of non - NA values in a row", we're able to not care about the number of columns being processed;使用“计算一行中非NA值的数量”的反转逻辑,我们可以不关心正在处理的列数; this is what allows me to use == 0L .这就是允许我使用== 0L的原因。
  • +(...) is a shorthand trick for converting logical to 0:1 +(...)是将logical转换为0:1的速记技巧

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM