简体   繁体   中英

A way to recode consecutive Start and Stop Dates in Long format to one vector

I have long data where people are staying in a location for multiple weeks, but some of the entries represent a single stay, where other represent consecutive stays where they "re-upped" their registration.

I want to identify a way to re-code the data such that each row only represents one stay per person, collapsing the single stays with multiple entries into one row.

I'd like to do this by pulling the true start and stop date into a single row per instance.

The issue is that we have no way of grouping these stays aside from if the previous end date equals the subsequent start date. The number of true stays and the number of multiple entries for a single stay varies widely per individual.

This is an example of what the data looks like:

ID   Start_Date     End_Date
1     05/06/18       05/10/18
1     05/10/18       05/14/18  
1     05/14/18       05/25/18
1     06/28/19       07/02/19
1     07/02/19       07/08/19
2     04/20/18       04/23/18
2     07/20/18       07/25/18 
2     07/26/18       07/30/18 
3     05/14/17       05/29/17

I want it to look like:

ID    Start_Date     End_Date
1      05/06/18      05/25/18
1      06/28/19      07/08/19
2      04/20/18      04/23/18
2      07/20/18      07/30/18
3      05/14/17      05/29/17

I am open to using R or SPSS to solve this - I have been dabbling with both but keep getting stuck, especially because I have some missing end dates.

I tried to do it all in one aggregate() call, but it got a bit messy. Easier to split()lapply() .

rr <- read.table(text="
   ID   Start_Date     End_Date
    1     05/06/18       05/10/18
    1     05/10/18       05/14/18  
    1     05/14/18       05/25/18
    1     06/28/19       07/02/19
    1     07/02/19       07/08/19
    2     04/20/18       04/23/18
    2     07/20/18       07/25/18 
    2     07/26/18       07/30/18 
    3     05/14/17       05/29/17", 
    stringsAsFactors=FALSE, header=TRUE)

# Convert to Date class
rr[,2:3] <- lapply(rr[,2:3], as.Date, format="%m/%d/%y")

# Group rows that have consecutive time periods
consec <- cumsum(c(FALSE, head(rr[,3], -1) - tail(rr[,2], -1) != 0))

# Or group rows that have time periods 0 or 1 apart
consec <- cumsum(c(FALSE, !(tail(rr[,2], -1) - head(rr[,3], -1)) %in% c(0, 1)))

# Combine with ID
consec <- paste(rr$ID, consec, sep=".")

# Split rows by group
sp <- split(rr, consec)

# Take the top-left and bottom-right value of each data.frame fragment
rrl <- lapply(sp, 
  function(x) {
      data.frame(ID=x[1, 1], Start_Date=x[1, 2], End_Date=x[nrow(x), 3])
  }
)

# Rejoin vertically
rr2 <- do.call(rbind, rrl)
rr2
#     ID Start_Date   End_Date
# 1.0  1 2018-05-06 2018-05-25
# 1.1  1 2019-06-28 2019-07-08
# 2.2  2 2018-04-20 2018-04-23
# 2.3  2 2018-07-20 2018-07-30
# 3.4  3 2017-05-14 2017-05-29

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM