为 data.table r 中的多列创建汇总变量

Question

I have the following data.table我有以下data.table

dt <- data.table(id=c(1,2,2,2,3,3,4),
                 date=c("2019-09-13", "2018-12-06", "2017-12-14", "2018-02-08", "2015-12-06", "2012-12-14", "2011-02-08"),
                 variable_1=c("a","b",NA,NA,"b","c",NA),
                 variable_2=c(NA,NA,"a",NA,"a","c",NA),
                 variable_3=c(NA,NA,NA,"b","c","c",NA))
dt
id       date variable_1 variable_2 variable_3
1:  1 2019-09-13          a       <NA>       <NA>
2:  2 2018-12-06          b       <NA>       <NA>
3:  2 2017-12-14       <NA>          a       <NA>
4:  2 2018-02-08       <NA>       <NA>          b
5:  3 2015-12-06          b          a          c
6:  3 2012-12-14          c          c          c
7:  4 2011-02-08       <NA>       <NA>       <NA>

I want to create a variable y that is summarizing all the columns.我想创建一个汇总所有列的变量y 。 Everything that has one .is.na() among the variable should be 0 .变量中有一个.is.na()的所有东西都应该是0 。 Every row that has only is.na among all the variables should be 1 .所有变量中只有is.na的每一行都应该是1 。 Like this:像这样：

   id       date variable_1 variable_2 variable_3 y
1:  1 2019-09-13          a       <NA>       <NA> 0
2:  2 2018-12-06          b       <NA>       <NA> 0
3:  2 2017-12-14       <NA>          a       <NA> 0
4:  2 2018-02-08       <NA>       <NA>          b 0
5:  3 2015-12-06          b          a          c 0
6:  3 2012-12-14          c          c          c 0
7:  4 2011-02-08       <NA>       <NA>       <NA> 1

In the original data.table I have 22 variables that I am looking at among 830 total variables.在原始data.table中，我在 830 个总变量中查看了 22 个变量。 So I would prefer not to look for every Variable with _1 to _22 separately.因此，我不希望分别查找具有_1到_22的每个变量。 Is there a way in data.table ? data.table有办法吗？

Answer 1

dt[, y := +(rowSums(!is.na(.SD)) == 0L), .SDcols = patterns("^variable_")]
#    id       date variable_1 variable_2 variable_3 y
# 1:  1 2019-09-13          a       <NA>       <NA> 0
# 2:  2 2018-12-06          b       <NA>       <NA> 0
# 3:  2 2017-12-14       <NA>          a       <NA> 0
# 4:  2 2018-02-08       <NA>       <NA>          b 0
# 5:  3 2015-12-06          b          a          c 0
# 6:  3 2012-12-14          c          c          c 0
# 7:  4 2011-02-08       <NA>       <NA>       <NA> 1

Walk-through:演练：

.SDcols=patterns(...) defines the columns to be processed as .SD in the j component. .SDcols=patterns(...)将要处理的列定义为j组件中的.SD 。 This doesn't involve removing / selecting columns for the output, just the ones that will be referenced internally.这不涉及删除/选择output 的列，仅涉及将在内部引用的列。
.is.na(.SD) returns a logical matrix , same dims as .SD , indicating if its value is NA . .is.na(.SD)返回一个logical matrix ，与.SD相同，表示其值是否为NA 。
rowSums(...) returns the count of non- NA s in the row. rowSums(...)返回行中非NA的计数。
using the inverted logic of "count the number of non - NA values in a row", we're able to not care about the number of columns being processed;使用“计算一行中非NA值的数量”的反转逻辑，我们可以不关心正在处理的列数； this is what allows me to use == 0L .这就是允许我使用== 0L的原因。
+(...) is a shorthand trick for converting logical to 0:1 +(...)是将logical转换为0:1的速记技巧

为 data.table r 中的多列创建汇总变量

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-08-06 22:36:55

为 data.table r 中的多列创建汇总变量

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-08-06 22:36:55

解决方案1
3 已采纳 2020-08-06 22:36:55