简体   繁体   English

使用data.table获取lapply组中的条件总和

[英]Using data.table for a conditional sum within groups with lapply

I have a data.table where each row is an event with a start date and end date, but the number of days between each start and end is variable. 我有一个data.table ,其中每一行都是一个具有开始日期和结束日期的事件,但是每个开始和结束之间的天数是可变的。 Therefore, I am attempting to count how many other events have already ended at the time each one begins. 因此,我试图计算在每个事件开始时已经结束了多少其他事件。 I can do this using lapply , but when I try to use data.table with the by functionality I don't get the expected output. 我可以使用lapply来做到这lapply ,但是当我尝试将data.tableby功能一起使用时,却无法获得预期的输出。 Sample code below: 下面的示例代码:

library(data.table)

DT <- data.table(
  start = as.Date(c("2018-07-01","2018-07-03","2018-07-06","2018-07-08","2018-07-12","2018-07-15")),
  end = as.Date(c("2018-07-10","2018-07-04","2018-07-09","2018-07-20","2018-07-14","2018-07-27")),
  group_id = c("a", "a", "a", "b", "b", "b"))

# This produces the expected output (0,0,1,1,3,4):
lapply(DT$start, function(x) sum(x > DT$end))

# This also works using data.table:
DT[, count := lapply(DT$start, function(x) sum(x > DT$end))]

# However, I don't get the expected output (0,0,1,0,0,1) when I attempt to do this by group_id
DT[, count_by_group := lapply(DT$start, function(x) sum(x > DT$end)), by = group_id]

With the following output, where count_by_group is not the expected result: 使用以下输出,其中count_by_group不是预期的结果:

        start        end group_id count count_by_group
1: 2018-07-01 2018-07-10        a     0              0
2: 2018-07-03 2018-07-04        a     0              0
3: 2018-07-06 2018-07-09        a     1              0
4: 2018-07-08 2018-07-20        b     1              0
5: 2018-07-12 2018-07-14        b     3              0
6: 2018-07-15 2018-07-27        b     4              0

Can someone help me understand how by changes the behavior? 有人可以帮助我了解如何by改变行为吗? I've also tried to use different versions of the .SD feature, but wasn't able to get that to work either. 我也尝试使用.SD功能的不同版本,但也无法使其正常工作。

unlist()

unlist() works as well: unlist()也可以:

DT[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id]

Non-equi join 非股权加入

Alternatively, this can also be solved by aggregating in a non-equi self join : 另外,这也可以通过聚合非等价自连接来解决:

DT[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
DT
  start end group_id count_by_group 1: 2018-07-01 2018-07-10 a 0 2: 2018-07-03 2018-07-04 a 0 3: 2018-07-06 2018-07-09 a 1 4: 2018-07-08 2018-07-20 b 0 5: 2018-07-12 2018-07-14 b 0 6: 2018-07-15 2018-07-27 b 1 

Benchmark 基准

The non-equi join is also the fastest method for cases with more than a few hundred rows: 对于具有几百行的案例,非等额联接也是最快的方法:

library(bench)
bm <- press(
  n_grp = c(2L, 5L, 10L),
  n_row = 10^(2:4),
  {
    set.seed(1L)
    DT = data.table(
      group_id = sample.int(n_grp, n_row, TRUE),
      start = as.Date("2018-07-01") + rpois(n_row, 20L))
    DT[, end := start + rpois(n_row, 10L)]
    setorder(DT, group_id, start, end)
    mark(
      unlist = copy(DT)[, count_by_group := unlist(lapply(start, function(x) sum(x > end))), by = group_id],
      sapply = copy(DT)[, count_by_group := sapply(start, function(x) sum(x > end)), by = group_id],
      vapply = copy(DT)[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id],
      nej = copy(DT)[, count_by_group := DT[DT, on = .(group_id, end < start), .N, by = .EACHI]$N]
    )
  }
)
ggplot2::autoplot(bm)

在此处输入图片说明

For 10000 rows, the non-equi join is about 10 times faster than the other methods. 对于10000行,非等值联接比其他方法快约10倍。

As DT is being updated, copy() is used to create a fresh, unmodified copy of DT for each benchmark run. 在更新DTcopy()用于为每次基准测试运行创建DT未经修改的全新副本。

DT[, count_by_group := vapply(start, function(x) sum(x > end), integer(1)), by = group_id]

To refer to start and end by group, we need to leave the DT$ prefix out. 要按组引用startend ,我们需要省略DT$前缀。
We use vapply() rather than lapply() because if the right hand side of := is a list, it is interpreted as a list of columns (and since only one column is expected, only the first element, a 0 , is taken into account and recycled). 我们使用vapply()而不是lapply()因为如果:=的右侧是列表,则将其解释为列的列表(并且由于只需要一个列,所以仅采用第一个元素0 )考虑并回收)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM