簡體   English   中英

r 最小最大日期(按 ID)和 ID 內的多個狀態更改

[英]r min max dates by id and multiple status changes within ID

我有一個動物跟蹤數據集,如下所示

 Id         Start       Stop          Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out
 78122      5/15/1988   6/18/1988     In
 78122      6/19/1988   1/12/1989     In
 78122      1/13/1989   2/23/1990     In
 78122      2/24/1990   6/15/1991     Out
 78122      6/16/1991   2/11/1993     Out
 78122      2/12/1993   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/22/1994   1/25/1996     Out
 78122      1/26/1996   11/13/2001    In
 78122      11/14/2001  11/19/2001    In
 78122      11/20/2001  9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

這種動物出生於 1919 年,但多次進出其本土。 我想要創建的是這樣的數據集。 我喜歡按狀態總結min(Start)max(Stop)日期。

例如:有三行表示該動物在1/18/19661/23/1974之間的領土內。

 Id         Start       Stop          Status
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In

此信息應匯總為 1 行,其中min(Start)max(Stop)如下所示

 Id         MinStart    MaxStop       Status
 78122      1/18/1966   1/23/1974     In

同樣有四行表明該動物在1/24/19745/14/1988之間離開了領地。

 Id         Start       Stop          Status
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out

此信息應匯總為 1 行,其中min(Start)max(Stop)如下所示

 Id         MinStart    MaxStop       Status
 78122      1/24/1974   5/14/1988     Out

對於其他 In 和 Out 狀態也是如此。 最終數據集應如下所示。

 Id         MinStart    MaxStop       Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   1/23/1974     In
 78122      1/24/1974   5/14/1988     Out
 78122      5/15/1988   2/23/1990     In
 78122      2/24/1990   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/28/1994   1/25/1996     Out
 78122      1/26/1996   9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

關於如何根據上述標准重新排列此數據集的任何建議都非常有用。 到目前為止我試過

 test1 <- testcase %>% 
          group_by(ID,Status) %>% 
          summarize(MinStart  = min(Start), MaxStop= max(Stop))

但這似乎不起作用。 它只是為所有 In Status 和 Out Status 一起創建一分鍾和停止日期。 這是不正確的。

您需要一些運行長度編碼 為方便起見,我將使用data.table::rleid ,但如果需要,您可以使用基本版本:

library(data.table)
testcase %>% 
  group_by(Id, RLE = rleid(Status)) %>%
  arrange(Start) %>%
  dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups:   Id [1]
     Id   RLE Start      Stop       Status
  <int> <int> <date>     <date>     <chr> 
1 78122     1 1919-10-12 1919-10-12 Birth 
2 78122     2 1966-01-18 1974-01-23 In    
3 78122     3 1974-01-24 1988-05-14 Out   
4 78122     4 1988-05-15 1990-02-23 In    
5 78122     5 1990-02-24 1994-05-03 Out   
6 78122     6 1994-05-04 1994-07-27 In    
7 78122     7 1994-07-22 1996-01-25 Out   
8 78122     8 1996-01-26 2009-09-01 In    
9 78122     9 2009-09-26 2009-09-26 Death 

請注意,我將您的日期轉換為 class date ,我將留給您。 否則他們不能正確排序。

這是沒有data.tablegroup_by調用

...
  group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...

一種方法是捕獲日期,同時使用sapply將它們強制轉換為數值,以便以后能夠使用range 然后,在ave中,我們在rle中使用mapply讓變量x每次狀態變化時增長 1。 我們現在可以輕松地aggregateIdx上的range s ,其中列子集已經為我們提供了結果,我們只需將其轉換為as.Date並使用gsubx的后綴cbind到它。

d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\\s", "", r$x))

結果

r[order(r$Id), ]
#       Id   MinStart    MaxStop Status
# 1  78122 1919-10-12 1919-10-12  Birth
# 2  78122 1966-01-18 1974-01-23     In
# 3  78122 1974-01-24 1988-05-14    Out
# 4  78122 1988-05-15 1990-02-23     In
# 5  78122 1990-02-24 1994-05-03    Out
# 6  78122 1994-05-04 1994-07-27     In
# 7  78122 1994-07-22 1996-01-25    Out
# 8  78122 1996-01-26 2009-09-01     In
# 9  78122 2009-09-26 2009-09-26  Death
# 10 78123 1919-10-12 1919-10-12  Birth
# 11 78123 1966-01-18 1974-01-23     In
# 12 78123 1974-01-24 1988-05-14    Out
# 13 78123 1988-05-15 1990-02-23     In
# 14 78123 1990-02-24 1994-05-03    Out
# 15 78123 1994-05-04 1994-07-27     In
# 16 78123 1994-07-22 1996-01-25    Out
# 17 78123 1996-01-26 2009-09-01     In
# 18 78123 2009-09-26 2009-09-26  Death

數據:

注意:出於演示目的,數據幀加倍, Id加一。

d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth", 
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In", 
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death", 
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In", 
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In", 
"Death")), class = "data.frame", row.names = c(NA, -40L))

例如,您可以使用insurancerating::reduce()

library(insurancerating)
library(dplyr)
library(lubridate)

d %>% 
  mutate(across(c(Start, Stop), lubridate::mdy)) %>%
  insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)

      Id Status index      Start       Stop
# 1  78122  Birth     0 1919-10-12 1919-10-12
# 2  78122  Death     0 2009-09-26 2009-09-26
# 3  78122     In     0 1966-01-18 1974-01-23
# 4  78122     In     1 1988-05-15 1990-02-23
# 5  78122     In     2 1994-05-04 1994-07-27
# 6  78122     In     3 1996-01-26 2009-09-01
# 7  78122    Out     0 1974-01-24 1988-05-14
# 8  78122    Out     1 1990-02-24 1994-05-03
# 9  78122    Out     2 1994-07-22 1996-01-25
# 10 78123  Birth     0 1919-10-12 1919-10-12
# 11 78123  Death     0 2009-09-26 2009-09-26
# 12 78123     In     0 1966-01-18 1974-01-23
# 13 78123     In     1 1988-05-15 1990-02-23
# 14 78123     In     2 1994-05-04 1994-07-27
# 15 78123     In     3 1996-01-26 2009-09-01
# 16 78123    Out     0 1974-01-24 1988-05-14
# 17 78123    Out     1 1990-02-24 1994-05-03
# 18 78123    Out     2 1994-07-22 1996-01-25

注意: d是@jay.sf 給出的數據

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM