简体   繁体   中英

How can I create a new dataframe in R that combines the first date and last date available for each ID?

For instance, suppose I have the following dataframe:

ID<-c("A", "A", "B", "B", "B", "C")
StartDate<-as.Date(c("2018-01-01", "2019-02-05", "2016-04-18", "2020-03-03", "2021-12-13", "2014-03-03"), "%Y-%m-%d")
TermDate<-as.Date(c("2018-02-01", NA, "2016-05-18", "2020-04-03", "2021-12-15", "2014-04-03"), "%Y-%m-%d")
df<-data.frame(ID=ID, StartDate=StartDate, TermDate=TermDate)

  ID  StartDate   TermDate
1  A 2018-01-01 2018-02-01
2  A 2019-02-05       <NA>
3  B 2016-04-18 2016-05-18
4  B 2020-03-03 2020-04-03
5  B 2021-12-13 2021-12-15
6  C 2014-03-03 2014-04-03

What I'm ultimately trying to get is the following:


  ID  StartDate   TermDate
1  A 2018-01-01       <NA>
2  B 2016-04-18 2021-12-15
3  C 2014-03-03 2014-04-03

There are functions first and last in dplyr and data.table that could help here.

library(dplyr)

df %>%
  group_by(ID) %>%
  summarise(StartDate = first(StartDate), 
            TermDate = last(TermDate))

#  ID    StartDate  TermDate  
#* <chr> <date>     <date>    
#1 A     2018-01-01 NA        
#2 B     2016-04-18 2021-12-15
#3 C     2014-03-03 2014-04-03

With data.table :

library(data.table)
setDT(df)[, .(StartDate = first(StartDate), TermDate = last(TermDate)), ID]

Using min and max instead of first and last will eliminate the need for sorting the data, if not already

df %>% group_by(ID) %>%
  summarise(StartDate = min(StartDate),
         TermDate = max(TermDate))

# A tibble: 3 x 3
  ID    StartDate  TermDate  
* <chr> <date>     <date>    
1 A     2018-01-01 NA        
2 B     2016-04-18 2021-12-15
3 C     2014-03-03 2014-04-03

See if your df is like this

> df
  ID  StartDate   TermDate
1  A 2019-02-05       <NA>
2  A 2018-01-01 2018-02-01
3  B 2016-04-18 2016-05-18
4  B 2020-03-03 2020-04-03
5  B 2021-12-13 2021-12-15
6  C 2014-03-03 2014-04-03

df %>% group_by(ID) %>%
  summarise(StartDate = first(StartDate),
         TermDate = last(TermDate))

# A tibble: 3 x 3
  ID    StartDate  TermDate  
* <chr> <date>     <date>    
1 A     2019-02-05 2018-02-01
2 B     2016-04-18 2021-12-15
3 C     2014-03-03 2014-04-03

We can also do

library(dplyr)
df %>%
  group_by(ID) %>%
  summarise(StartDate = StartDate[1]), 
            TermDate = TermDate[n()])

Another data.table option

setDT(df)[
  ,
  as.list(
    setNames(
      data.frame(.SD)[cbind(c(1, .N), c(1, 2))],
      names(.SD)
    )
  ), ID
]

gives

   ID  StartDate   TermDate
1:  A 2018-01-01       <NA>
2:  B 2016-04-18 2021-12-15
3:  C 2014-03-03 2014-04-03

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM