简体   繁体   English

data.table 计算两个变量的总和并为“空”组添加观察

[英]data.table calculate sums by two variables and add observations for "empty" groups

Sorry for the bad title - I am trying to achieve the following: I have a data.table dt with two categorical variables "a" and "b".抱歉标题不好 - 我正在尝试实现以下目标:我有一个 data.table dt,其中包含两个分类变量“a”和“b”。 As you can see, a has 5 unique values and b has three.如您所见,a 有 5 个唯一值,b 有 3 个。 Now eg the combination of categorical variables ("a = 1" and "b = 3") is not in the data.现在,例如分类变量的组合(“a = 1”和“b = 3”)不在数据中。

library(data.table) 
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- rnorm(10)

dt <- data.table(a = a, b = b, y = y)
dt[order(a, b), .N, by = c("a", "b")]

#  a b N
#1: 1 1 2
#2: 1 2 1
#3: 2 2 1
#4: 2 3 1
#5: 3 1 1
#6: 3 2 1
#7: 3 3 1
#8: 4 1 1
#9: 5 2 1

If I simply sum "a" and "b", such groups as ("a = 1" and b = 3") will simply be ignored:如果我简单地将“a”和“b”相加,诸如 ("a = 1" and b = 3") 之类的组将被忽略:

group_sum <- dt[, lapply(.SD, sum), by = c("a", "b")]
group_sum

#   a b          y
#1: 1 1 -0.7702614
#2: 4 1 -0.2894616
#3: 1 2 -0.2992151
#4: 2 2 -0.4115108
#5: 5 2  0.2522234
#6: 3 2 -0.8919211
#7: 2 3  0.4356833
#8: 3 1 -1.2375384
#9: 3 3 -0.2242679

Is there an internal way in data table to "keep" such missing groups and either assign a 0 or NA?数据表中是否有内部方法来“保留”此类缺失的组并分配 0 或 NA?

One way to achieve my goal would be to create a grid and merge in a second step:实现我的目标的一种方法是创建一个网格并在第二步中合并:

grid <- unique(expand.grid(a = dt$a, b = dt$b)) # dim 
setDT(grid)

res <- merge(grid, group_sum, by = c("a", "b"), all.x = TRUE)
head(res)

#   a b          y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3         NA
#4: 2 1         NA
#5: 2 2 -0.4115108
#6: 2 3  0.4356833

One way of going about this is to do a keyed cross-join with the CJ() function and then using .EACHI to note that y should be executed for every row in i .解决这个问题的一种方法是使用CJ()函数进行键控交叉连接,然后使用.EACHI来注意y应该对i每一行执行。

library(data.table)

set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- rnorm(10)

dt <- data.table(a = a, b = b, y = y)
setkeyv(dt, c("a", "b"))

dt[CJ(a, b, unique = TRUE), lapply(.SD, sum), by = .EACHI]
#>     a b          y
#>  1: 1 1 -0.7702614
#>  2: 1 2 -0.2992151
#>  3: 1 3         NA
#>  4: 2 1         NA
#>  5: 2 2 -0.4115108
#>  6: 2 3  0.4356833
#>  7: 3 1 -1.2375384
#>  8: 3 2 -0.8919211
#>  9: 3 3 -0.2242679
#> 10: 4 1 -0.2894616
#> 11: 4 2         NA
#> 12: 4 3         NA
#> 13: 5 1         NA
#> 14: 5 2  0.2522234
#> 15: 5 3         NA

Created on 2020-10-03 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 10 月 3 日创建

If you want to skip the key-setting step you could alternatively set the on argument:如果您想跳过键设置步骤,您也可以设置on参数:

dt <- data.table(a = a, b = b, y = y) # Set no key
dt[CJ(a, b, unique = TRUE), lapply(.SD, sum), by = .EACHI, on = c("a", "b")]

You can also use dplyr and tidyr with a complete() function:您还可以将 dplyr 和 tidyr 与 complete() 函数一起使用:

library(dplyr)
library(tidyr)
dt %>% 
group_by(a,b) %>% 
complete(a,b) %>% 
summarize_all(sum) 
# A tibble: 15 x 3
# Groups:   a [5]
   a     b          y
   <fct> <fct>  <dbl>
 1 1     1      -6.93
 2 1     2      -2.69
 3 1     3      NA   
 4 2     1      NA   
 5 2     2      -3.70
 6 2     3       3.92
 7 3     1     -11.1 
 8 3     2      -8.03
 9 3     3      -2.02
10 4     1      -2.61
11 4     2      NA   
12 4     3      NA   
13 5     1      NA   
14 5     2       2.27
15 5     3      NA   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM