简体   繁体   中英

Finding duration between start and end flags in a data frame column by splitting into two different days if if exceeds one day

I have a dataframe like this:

   timestamp           Status
05-01-2020  12:07:08    0
05-01-2020  12:36:05    1
05-01-2020  23:45:02    0
05-01-2020  13:44:33    1
06-01-2020  01:07:08    1
06-01-2020  10:23:05    1
06-01-2020  12:11:08    1
06-01-2020  22:06:12    1
07-01-2020  00:01:05    0
07-01-2020  02:17:09    1
07-01-2020  12:36:05    1
07-01-2020  12:07:08    1
07-01-2020  12:36:05    1
07-01-2020  12:36:05    0
08-01-2020  12:36:05    1
08-01-2020  12:36:05    0
08-01-2020  12:36:05    0
09-01-2020  12:36:05    1
09-01-2020  12:07:08    0
09-01-2020  12:36:05    1
11-01-2020  12:07:08    0
11-01-2020  12:36:05    1

I am trying to find the duration between each 1,0 pair. But i my data I can have status coming in any order. I may have 1 and 0 occurring 0ne by one..or I may have many 1s followed by a 0 etc.. I am trying to cut the duration into two if start (1) is on on day and end (0) is on next day provided they are continuous dates (like 1,2,3,4) and there is no 1s in between or there are any number of 1s between 1 and 0. First occurrence of 1 is like start...and first occurrence of 0 is like end.

I am able to calculate in the straight forward condition if 1 and 0 are on same date. Also if it is on two dates, I am able to calculate the difference between occurrence of 1 and 23:59:59 for first day and similarly from 00:00:00 till occurrence of second day.

Ex: let me have one set of data like this

07-01-2020  21:26:05    1
08-01-2020  02:33:45    0

These two fall on two different dates. So instead of finding the difference directly I want to cut it into two. So on first day ( 07-01-2020 ) my duration will be from 21:26:05 to 23:59:59 and on second day it will be from 00:00:00 to 02:33:45 . This should repeat for any number of continuous dates.(like 7,8,9,10 etc)

But If have data like this:

07-01-2020  21:26:05    1
08-01-2020  02:33:45    1
09-01-2020  21:26:05    1
11-01-2020  02:33:45    1

I have to cut at (because after 9th its 11th so continuity is broken)

07-01-2020  21:26:05 to  07-01-2020  23:59:59
08-01-2020  00:00:00 to  08-01-2020  02:33:45
08-01-2020  02:33:45 to  08-01-2020  23:59:59
09-01-2020  00:00:00 to  09-01-2020  21:26:05
09-01-2020  21:26:05 to  09-01-2020  23:59:59

conditions like this:

07-01-2020  21:26:05    1
07-01-2020  22:33:45    1
07-01-2020  23:31:51    1
07-01-2020  23:48:33    0
07-01-2020  23:57:12    0

is same as:

 07-01-2020  21:26:05    1
  07-01-2020  23:48:33    0

And conditions like this:

07-01-2020  21:26:05    1
07-01-2020  22:33:45    1
07-01-2020  23:31:51    1
08-01-2020  03:48:33    0
08-01-2020  03:57:12    0

is same as:

  07-01-2020  21:26:05   to  07-01-2020  23:59:59
  07-01-2020  00:00:00   to  08-01-2020  03:48:33 

I tried ifelse condition using in datatable and I was able to do the first split from x to 23:59:59 on the first day. But no other conditions are working.

 df[, difference := ifelse((df$Status == 0 & shift(df$Status,type='lag') == 1) & (as.Date(df$timestamp) !=  shift(as.Date(df$timestamp),type = 'lag')),
    as.numeric(df$timestamp - as.POSIXct(paste0(as.Date(timestamp)," ","00:00:00"),tz="UTC"),units='mins'),ifelse((df$Status == 1 & shift(df$Status,type='lead') == 0) & as.Date(df$timestamp) !=  shift(as.Date(df$timestamp),type = 'lead'),as.numeric(as.POSIXct(paste0(as.Date(timestamp)," ","23:59:59"),tz="UTC") - df$timestamp,units='mins'),
    as.numeric(shift(df$timestamp,type = 'lead') -  df$timestamp,units='mins')))]
library(tidyverse)
# Non-daily split: 
df %>% 
  mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>% 
  group_by(grp) %>% 
  filter(!(duplicated(ind))) %>% 
  ungroup() %>% 
  mutate(duration = difftime(timestamp, lag(timestamp), units = "hours"))


# Daily split: 
df %>% 
  group_by(grp1 = as.Date(timestamp, "%Y-%m-%d")) %>% 
  filter(!duplicated(ind)) %>% 
  ungroup() %>% 
  mutate(grp = cumsum(ifelse(ind == 0, 1, 0))) %>% 
  group_by(grp, grp1) %>% 
  mutate(duration = difftime(timestamp, lag(timestamp), units = "hours")) %>% 
  ungroup()

Let

A = data.frame(timestamp = c(as.POSIXlt("2020-07-01 21:26:05"), 
                             as.POSIXlt("2020-07-02  02:33:45"), 
                             as.POSIXlt("2020-07-02  10:33:45"),
                             as.POSIXlt("2020-07-03  15:33:45"),
                             as.POSIXlt("2020-07-04  02:33:45")),
               ind = as.numeric(c(0, 1, 1, 0, 1) ))

> A
            timestamp ind
1 2020-07-01 21:26:05   0
2 2020-07-02 02:33:45   1
3 2020-07-02 10:33:45   1
4 2020-07-03 15:33:45   0
5 2020-07-04 02:33:45   1

be toy data for this example. Then the following code gives you the time distance between the first occurences of successive 0s and 1s.

A %>%
  mutate(Diff = ind - lag(ind)) %>% 
  filter(is.na(Diff) | Diff != 0) %>% 
  mutate(Timedist = timestamp - lag(timestamp)) %>%
  select(-Diff)

with output

            timestamp ind   Timedist
1 2020-07-01 21:26:05   0    NA hours
2 2020-07-02 02:33:45   1   5.1 hours
3 2020-07-03 15:33:45   0  37.0 hours
4 2020-07-04 02:33:45   1  11.0 hours

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM