[英]Most recent date by id, data.table R
我需要根据id
获取可变reason
等于y
的最新日期。
以下是数据示例:
id group start end reason prom
1: 1 a 2009-01-01 2009-12-31 x 2016-08-15
2: 1 a 2010-01-01 2010-12-31 x 2016-08-15
3: 1 b 2010-01-01 2010-12-31 x 2016-08-15
4: 1 b 2011-01-01 2011-12-31 x 2016-08-15
5: 1 b 2012-01-01 2012-12-31 x 2016-08-15
6: 1 a 2012-01-01 2012-12-31 x 2016-08-15
7: 1 a 2013-01-01 2013-02-14 x 2016-08-15
8: 1 a 2013-02-15 2013-05-31 x 2016-08-15
9: 1 a 2013-06-01 2013-12-31 y 2016-08-15
10: 1 a 2014-01-01 2014-12-31 x 2016-08-15
11: 1 a 2015-01-01 2015-12-31 x 2016-08-15
12: 1 a 2016-01-01 2016-08-14 x 2016-08-15
13: 1 a 2016-08-15 2016-12-31 y 2016-08-15
14: 1 a 2017-01-01 2017-12-31 x 2016-08-15
15: 1 a 2018-01-01 2018-12-31 x 2016-08-15
16: 1 a 2019-01-01 9999-12-31 x 2016-08-15
17: 2 a 2009-01-01 2009-12-31 x 2016-08-15
18: 2 a 2010-01-01 2010-12-31 x 2016-08-15
19: 2 a 2011-01-01 2011-01-14 x 2016-08-15
20: 2 a 2011-01-15 2011-12-31 y 2016-08-15
21: 2 a 2012-01-01 2012-12-31 x 2016-08-15
22: 2 a 2013-01-01 2013-07-14 x 2016-08-15
23: 2 a 2013-07-15 2013-12-31 y 2016-08-15
24: 2 a 2014-01-01 2014-12-31 x 2016-08-15
25: 2 a 2015-01-01 2015-12-31 x 2016-08-15
我尝试过的是:
setDT(b)[, prom := last(as.Date(b$start[b$reason == "y"]), order_by = b$start[b$reason == "y"]), by =.id)]
和,
setDT(b)[, prom := max(as.Date(b$start[b$reason == "y"])), by =.(id)]
如您所见,我无法通过 ID 获得结果。 当id
为 1 时, prom
应为2016-08-15
,当id
为 2 时应为2013-07-15
。
对于我做错的任何提示,不胜感激。
以下应该有效:(1)将start
转换为Date
列,并且(2)将prom
定义为start
中的最后一个日期, reason == "y"
按id
分组。
library(data.table)
setDT(b)[, start := as.Date(start)][, prom := last(start[reason == "y"]), by = "id"][]
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
#> id group start end reason prom
#> 1: 1 a 2009-01-01 2009-12-31 x 2016-08-15
#> 2: 1 a 2010-01-01 2010-12-31 x 2016-08-15
#> 3: 1 b 2010-01-01 2010-12-31 x 2016-08-15
#> 4: 1 b 2011-01-01 2011-12-31 x 2016-08-15
#> 5: 1 b 2012-01-01 2012-12-31 x 2016-08-15
#> 6: 1 a 2012-01-01 2012-12-31 x 2016-08-15
#> 7: 1 a 2013-01-01 2013-02-14 x 2016-08-15
#> 8: 1 a 2013-02-15 2013-05-31 x 2016-08-15
#> 9: 1 a 2013-06-01 2013-12-31 y 2016-08-15
#> 10: 1 a 2014-01-01 2014-12-31 x 2016-08-15
#> 11: 1 a 2015-01-01 2015-12-31 x 2016-08-15
#> 12: 1 a 2016-01-01 2016-08-14 x 2016-08-15
#> 13: 1 a 2016-08-15 2016-12-31 y 2016-08-15
#> 14: 1 a 2017-01-01 2017-12-31 x 2016-08-15
#> 15: 1 a 2018-01-01 2018-12-31 x 2016-08-15
#> 16: 1 a 2019-01-01 9999-12-31 x 2016-08-15
#> 17: 2 a 2009-01-01 2009-12-31 x 2013-07-15
#> 18: 2 a 2010-01-01 2010-12-31 x 2013-07-15
#> 19: 2 a 2011-01-01 2011-01-14 x 2013-07-15
#> 20: 2 a 2011-01-15 2011-12-31 y 2013-07-15
#> 21: 2 a 2012-01-01 2012-12-31 x 2013-07-15
#> 22: 2 a 2013-01-01 2013-07-14 x 2013-07-15
#> 23: 2 a 2013-07-15 2013-12-31 y 2013-07-15
#> 24: 2 a 2014-01-01 2014-12-31 x 2013-07-15
#> 25: 2 a 2015-01-01 2015-12-31 x 2013-07-15
#> id group start end reason prom
注意:如果start
列中的日期尚未排序(如示例数据集中),我们可以使用:
setDT(b)[, start := as.Date(start)][order(start), prom := last(start[reason == "y"]), by = "id"]
数据
b <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
group = c("a", "a", "b", "b", "b", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a"), start = c("2009-01-01", "2010-01-01", "2010-01-01",
"2011-01-01", "2012-01-01", "2012-01-01", "2013-01-01", "2013-02-15",
"2013-06-01", "2014-01-01", "2015-01-01", "2016-01-01", "2016-08-15",
"2017-01-01", "2018-01-01", "2019-01-01", "2009-01-01", "2010-01-01",
"2011-01-01", "2011-01-15", "2012-01-01", "2013-01-01", "2013-07-15",
"2014-01-01", "2015-01-01"), end = c("2009-12-31", "2010-12-31",
"2010-12-31", "2011-12-31", "2012-12-31", "2012-12-31", "2013-02-14",
"2013-05-31", "2013-12-31", "2014-12-31", "2015-12-31", "2016-08-14",
"2016-12-31", "2017-12-31", "2018-12-31", "9999-12-31", "2009-12-31",
"2010-12-31", "2011-01-14", "2011-12-31", "2012-12-31", "2013-07-14",
"2013-12-31", "2014-12-31", "2015-12-31"), reason = c("x",
"x", "x", "x", "x", "x", "x", "x", "y", "x", "x", "x", "y",
"x", "x", "x", "x", "x", "x", "y", "x", "x", "y", "x", "x"
)), row.names = c(NA, -25L), class = "data.frame")
对于每个id
,我们可以获得相应的start
日期,其中reason = y
并取max
library(data.table)
df[, (max = max(start[reason == 'y'])), by = id]
# id V1
#1: 1 2016-08-15
#2: 2 2013-07-15
如果要在当前 dataframe 中添加新列,
df[, max := max(start[reason == 'y']), by = id]
使用dplyr
,我们可以这样做
library(dplyr)
df %>%
group_by(id) %>%
mutate(max_date = max(start[reason == 'y']))
确保start
列是 class “日期”。
我习惯分两步执行此操作:
dt[reason == "y", max.date := max(start), by = id]
dt[, max.date := max(max.date, na.rm = TRUE), by = id]
这是因为我不知道您可以像@Ronak 那样根据其他列过滤列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.