繁体   English   中英

id 的最近日期,data.table R

[英]Most recent date by id, data.table R

我需要根据id获取可变reason等于y的最新日期。

以下是数据示例:

    id group      start        end reason       prom
 1:  1     a 2009-01-01 2009-12-31      x 2016-08-15
 2:  1     a 2010-01-01 2010-12-31      x 2016-08-15
 3:  1     b 2010-01-01 2010-12-31      x 2016-08-15
 4:  1     b 2011-01-01 2011-12-31      x 2016-08-15
 5:  1     b 2012-01-01 2012-12-31      x 2016-08-15
 6:  1     a 2012-01-01 2012-12-31      x 2016-08-15
 7:  1     a 2013-01-01 2013-02-14      x 2016-08-15
 8:  1     a 2013-02-15 2013-05-31      x 2016-08-15
 9:  1     a 2013-06-01 2013-12-31      y 2016-08-15
10:  1     a 2014-01-01 2014-12-31      x 2016-08-15
11:  1     a 2015-01-01 2015-12-31      x 2016-08-15
12:  1     a 2016-01-01 2016-08-14      x 2016-08-15
13:  1     a 2016-08-15 2016-12-31      y 2016-08-15
14:  1     a 2017-01-01 2017-12-31      x 2016-08-15
15:  1     a 2018-01-01 2018-12-31      x 2016-08-15
16:  1     a 2019-01-01 9999-12-31      x 2016-08-15
17:  2     a 2009-01-01 2009-12-31      x 2016-08-15
18:  2     a 2010-01-01 2010-12-31      x 2016-08-15
19:  2     a 2011-01-01 2011-01-14      x 2016-08-15
20:  2     a 2011-01-15 2011-12-31      y 2016-08-15
21:  2     a 2012-01-01 2012-12-31      x 2016-08-15
22:  2     a 2013-01-01 2013-07-14      x 2016-08-15
23:  2     a 2013-07-15 2013-12-31      y 2016-08-15
24:  2     a 2014-01-01 2014-12-31      x 2016-08-15
25:  2     a 2015-01-01 2015-12-31      x 2016-08-15

我尝试过的是:

setDT(b)[, prom := last(as.Date(b$start[b$reason == "y"]), order_by = b$start[b$reason == "y"]), by =.id)]

和,

setDT(b)[, prom := max(as.Date(b$start[b$reason == "y"])), by =.(id)]

如您所见,我无法通过 ID 获得结果。 id为 1 时, prom应为2016-08-15 ,当id为 2 时应为2013-07-15

对于我做错的任何提示,不胜感激。

以下应该有效:(1)将start转换为Date列,并且(2)将prom定义为start中的最后一个日期, reason == "y"id分组。

library(data.table)

setDT(b)[, start := as.Date(start)][, prom := last(start[reason == "y"]), by = "id"][]
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#>     id group      start        end reason       prom
#>  1:  1     a 2009-01-01 2009-12-31      x 2016-08-15
#>  2:  1     a 2010-01-01 2010-12-31      x 2016-08-15
#>  3:  1     b 2010-01-01 2010-12-31      x 2016-08-15
#>  4:  1     b 2011-01-01 2011-12-31      x 2016-08-15
#>  5:  1     b 2012-01-01 2012-12-31      x 2016-08-15
#>  6:  1     a 2012-01-01 2012-12-31      x 2016-08-15
#>  7:  1     a 2013-01-01 2013-02-14      x 2016-08-15
#>  8:  1     a 2013-02-15 2013-05-31      x 2016-08-15
#>  9:  1     a 2013-06-01 2013-12-31      y 2016-08-15
#> 10:  1     a 2014-01-01 2014-12-31      x 2016-08-15
#> 11:  1     a 2015-01-01 2015-12-31      x 2016-08-15
#> 12:  1     a 2016-01-01 2016-08-14      x 2016-08-15
#> 13:  1     a 2016-08-15 2016-12-31      y 2016-08-15
#> 14:  1     a 2017-01-01 2017-12-31      x 2016-08-15
#> 15:  1     a 2018-01-01 2018-12-31      x 2016-08-15
#> 16:  1     a 2019-01-01 9999-12-31      x 2016-08-15
#> 17:  2     a 2009-01-01 2009-12-31      x 2013-07-15
#> 18:  2     a 2010-01-01 2010-12-31      x 2013-07-15
#> 19:  2     a 2011-01-01 2011-01-14      x 2013-07-15
#> 20:  2     a 2011-01-15 2011-12-31      y 2013-07-15
#> 21:  2     a 2012-01-01 2012-12-31      x 2013-07-15
#> 22:  2     a 2013-01-01 2013-07-14      x 2013-07-15
#> 23:  2     a 2013-07-15 2013-12-31      y 2013-07-15
#> 24:  2     a 2014-01-01 2014-12-31      x 2013-07-15
#> 25:  2     a 2015-01-01 2015-12-31      x 2013-07-15
#>     id group      start        end reason       prom

注意:如果start列中的日期尚未排序(如示例数据集中),我们可以使用:

setDT(b)[, start := as.Date(start)][order(start), prom := last(start[reason == "y"]), by = "id"]

数据

b <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    group = c("a", "a", "b", "b", "b", "a", "a", "a", "a", "a", 
    "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", 
    "a", "a", "a"), start = c("2009-01-01", "2010-01-01", "2010-01-01", 
    "2011-01-01", "2012-01-01", "2012-01-01", "2013-01-01", "2013-02-15", 
    "2013-06-01", "2014-01-01", "2015-01-01", "2016-01-01", "2016-08-15", 
    "2017-01-01", "2018-01-01", "2019-01-01", "2009-01-01", "2010-01-01", 
    "2011-01-01", "2011-01-15", "2012-01-01", "2013-01-01", "2013-07-15", 
    "2014-01-01", "2015-01-01"), end = c("2009-12-31", "2010-12-31", 
    "2010-12-31", "2011-12-31", "2012-12-31", "2012-12-31", "2013-02-14", 
    "2013-05-31", "2013-12-31", "2014-12-31", "2015-12-31", "2016-08-14", 
    "2016-12-31", "2017-12-31", "2018-12-31", "9999-12-31", "2009-12-31", 
    "2010-12-31", "2011-01-14", "2011-12-31", "2012-12-31", "2013-07-14", 
    "2013-12-31", "2014-12-31", "2015-12-31"), reason = c("x", 
    "x", "x", "x", "x", "x", "x", "x", "y", "x", "x", "x", "y", 
    "x", "x", "x", "x", "x", "x", "y", "x", "x", "y", "x", "x"
    )), row.names = c(NA, -25L), class = "data.frame")

对于每个id ,我们可以获得相应的start日期,其中reason = y并取max

library(data.table)
df[, (max = max(start[reason == 'y'])), by = id]

#   id         V1
#1:  1 2016-08-15
#2:  2 2013-07-15

如果要在当前 dataframe 中添加新列,

df[, max := max(start[reason == 'y']), by = id]

使用dplyr ,我们可以这样做

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(max_date = max(start[reason == 'y']))

确保start列是 class “日期”。

我习惯分两步执行此操作:

dt[reason == "y", max.date := max(start), by = id]
dt[, max.date := max(max.date, na.rm = TRUE), by = id]

这是因为我不知道您可以像@Ronak 那样根据其他列过滤列。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM