r 最小最大日期（按 ID）和 ID 內的多個狀態更改

Question

我有一個動物跟蹤數據集，如下所示

 Id         Start       Stop          Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out
 78122      5/15/1988   6/18/1988     In
 78122      6/19/1988   1/12/1989     In
 78122      1/13/1989   2/23/1990     In
 78122      2/24/1990   6/15/1991     Out
 78122      6/16/1991   2/11/1993     Out
 78122      2/12/1993   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/22/1994   1/25/1996     Out
 78122      1/26/1996   11/13/2001    In
 78122      11/14/2001  11/19/2001    In
 78122      11/20/2001  9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

這種動物出生於 1919 年，但多次進出其本土。 我想要創建的是這樣的數據集。 我喜歡按狀態總結min(Start)和max(Stop)日期。

例如：有三行表示該動物在1/18/1966至1/23/1974之間的領土內。

 Id         Start       Stop          Status
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In

此信息應匯總為 1 行，其中min(Start)和max(Stop)如下所示

 Id         MinStart    MaxStop       Status
 78122      1/18/1966   1/23/1974     In

同樣有四行表明該動物在1/24/1974到5/14/1988之間離開了領地。

 Id         Start       Stop          Status
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out

此信息應匯總為 1 行，其中min(Start)和max(Stop)如下所示

 Id         MinStart    MaxStop       Status
 78122      1/24/1974   5/14/1988     Out

對於其他 In 和 Out 狀態也是如此。 最終數據集應如下所示。

 Id         MinStart    MaxStop       Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   1/23/1974     In
 78122      1/24/1974   5/14/1988     Out
 78122      5/15/1988   2/23/1990     In
 78122      2/24/1990   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/28/1994   1/25/1996     Out
 78122      1/26/1996   9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

關於如何根據上述標准重新排列此數據集的任何建議都非常有用。 到目前為止我試過

 test1 <- testcase %>% 
          group_by(ID,Status) %>% 
          summarize(MinStart  = min(Start), MaxStop= max(Stop))

但這似乎不起作用。 它只是為所有 In Status 和 Out Status 一起創建一分鍾和停止日期。 這是不正確的。

Answer 1

您需要一些運行長度編碼。 為方便起見，我將使用data.table::rleid ，但如果需要，您可以使用基本版本：

library(data.table)
testcase %>% 
  group_by(Id, RLE = rleid(Status)) %>%
  arrange(Start) %>%
  dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups:   Id [1]
     Id   RLE Start      Stop       Status
  <int> <int> <date>     <date>     <chr> 
1 78122     1 1919-10-12 1919-10-12 Birth 
2 78122     2 1966-01-18 1974-01-23 In    
3 78122     3 1974-01-24 1988-05-14 Out   
4 78122     4 1988-05-15 1990-02-23 In    
5 78122     5 1990-02-24 1994-05-03 Out   
6 78122     6 1994-05-04 1994-07-27 In    
7 78122     7 1994-07-22 1996-01-25 Out   
8 78122     8 1996-01-26 2009-09-01 In    
9 78122     9 2009-09-26 2009-09-26 Death

請注意，我將您的日期轉換為 class date ，我將留給您。 否則他們不能正確排序。

這是沒有data.table的group_by調用

...
  group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...

Answer 2

一種方法是捕獲日期，同時使用sapply將它們強制轉換為數值，以便以后能夠使用range 。 然后，在ave中，我們在rle中使用mapply讓變量x每次狀態變化時增長 1。 我們現在可以輕松地aggregate合Id和x上的range s ，其中列子集已經為我們提供了結果，我們只需將其轉換為as.Date並使用gsub將x的后綴cbind到它。

d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\\s", "", r$x))

結果

r[order(r$Id), ]
#       Id   MinStart    MaxStop Status
# 1  78122 1919-10-12 1919-10-12  Birth
# 2  78122 1966-01-18 1974-01-23     In
# 3  78122 1974-01-24 1988-05-14    Out
# 4  78122 1988-05-15 1990-02-23     In
# 5  78122 1990-02-24 1994-05-03    Out
# 6  78122 1994-05-04 1994-07-27     In
# 7  78122 1994-07-22 1996-01-25    Out
# 8  78122 1996-01-26 2009-09-01     In
# 9  78122 2009-09-26 2009-09-26  Death
# 10 78123 1919-10-12 1919-10-12  Birth
# 11 78123 1966-01-18 1974-01-23     In
# 12 78123 1974-01-24 1988-05-14    Out
# 13 78123 1988-05-15 1990-02-23     In
# 14 78123 1990-02-24 1994-05-03    Out
# 15 78123 1994-05-04 1994-07-27     In
# 16 78123 1994-07-22 1996-01-25    Out
# 17 78123 1996-01-26 2009-09-01     In
# 18 78123 2009-09-26 2009-09-26  Death

數據：

注意：出於演示目的，數據幀加倍， Id加一。

d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth", 
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In", 
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death", 
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In", 
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In", 
"Death")), class = "data.frame", row.names = c(NA, -40L))

Answer 3

例如，您可以使用insurancerating::reduce() ：

library(insurancerating)
library(dplyr)
library(lubridate)

d %>% 
  mutate(across(c(Start, Stop), lubridate::mdy)) %>%
  insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)

      Id Status index      Start       Stop
# 1  78122  Birth     0 1919-10-12 1919-10-12
# 2  78122  Death     0 2009-09-26 2009-09-26
# 3  78122     In     0 1966-01-18 1974-01-23
# 4  78122     In     1 1988-05-15 1990-02-23
# 5  78122     In     2 1994-05-04 1994-07-27
# 6  78122     In     3 1996-01-26 2009-09-01
# 7  78122    Out     0 1974-01-24 1988-05-14
# 8  78122    Out     1 1990-02-24 1994-05-03
# 9  78122    Out     2 1994-07-22 1996-01-25
# 10 78123  Birth     0 1919-10-12 1919-10-12
# 11 78123  Death     0 2009-09-26 2009-09-26
# 12 78123     In     0 1966-01-18 1974-01-23
# 13 78123     In     1 1988-05-15 1990-02-23
# 14 78123     In     2 1994-05-04 1994-07-27
# 15 78123     In     3 1996-01-26 2009-09-01
# 16 78123    Out     0 1974-01-24 1988-05-14
# 17 78123    Out     1 1990-02-24 1994-05-03
# 18 78123    Out     2 1994-07-22 1996-01-25

注意： d是@jay.sf 給出的數據

r 最小最大日期（按 ID）和 ID 內的多個狀態更改

問題描述

3 個解決方案

解決方案1
1 已采納 2021-01-06 18:49:22

解決方案2
0 2021-01-06 19:56:51

結果

解決方案3
0 2021-01-06 20:19:59

r 最小最大日期（按 ID）和 ID 內的多個狀態更改

問題描述

3 個解決方案

解決方案1 1 已采納 2021-01-06 18:49:22

解決方案2 0 2021-01-06 19:56:51

結果

解決方案3 0 2021-01-06 20:19:59

解決方案1
1 已采納 2021-01-06 18:49:22

解決方案2
0 2021-01-06 19:56:51

解決方案3
0 2021-01-06 20:19:59