简体   繁体   中英

Grouping if consecutive values meet condition

With the following table dat , my objective is to group by user_id and mobile_id only where there is a continuous sequence of values where difftime > - 600 . The sequence must be consecutive in created_at , and given a rank. Each separate group would be assigned an incremental value, For example :

> dat
            created_at user_id mobile_id   status difftime
1  2019-01-02 22:01:38 1227604     68409 finished      \\N
2  2019-01-03 04:08:29 1227604     68409 finished     -366
3  2019-01-03 15:16:38 1227604     68409  timeout     -668
4  2019-01-04 00:34:40 1227604     68409   failed     -558
5  2019-01-04 00:27:37 1227605     68453   failed      \\N
6  2019-01-04 00:35:56 1227605     68453 finished       -8
7  2019-01-04 01:39:52 1227605     68453 finished      -63
8  2019-01-04 02:05:53 1227605     68453  timeout      -26
9  2019-01-04 02:17:17 1227605     68453  timeout      -11
10 2019-01-04 16:51:39 1227605     68453  timeout     -874

Would create an output of

> output
            created_at user_id mobile_id   status difftime group rank
1  2019-01-02 22:01:38 1227604     68409 finished      \\N    NA   NA
2  2019-01-03 04:08:29 1227604     68409 finished     -366     1    1
3  2019-01-03 15:16:38 1227604     68409  timeout     -668    NA   NA
4  2019-01-04 00:34:40 1227604     68409   failed     -558     2    1
5  2019-01-04 00:27:37 1227605     68453   failed      \\N    NA   NA
6  2019-01-04 00:35:56 1227605     68453 finished       -8     3    1
7  2019-01-04 01:39:52 1227605     68453 finished      -63     3    2
8  2019-01-04 02:05:53 1227605     68453  timeout      -26     3    3
9  2019-01-04 02:17:17 1227605     68453  timeout      -11     3    4
10 2019-01-04 16:51:39 1227605     68453  timeout     -874    NA   NA

I am not sure where to begin, beyond a simple grouping in dplyr . How would one go about assigning a group and rank ?

dat %>%
  group_by(user_id, mobile_id) %>%
  arrange(created_at) %>%
  filter(difftime > -600)

The data:

> dput(dat)
structure(list(created_at = structure(c(1546466498.138, 1546488509.218, 
1546528598.628, 1546562080.81, 1546561657.567, 1546562156.632, 
1546565992.788, 1546567553.811, 1546568237.325, 1546620699.964
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), user_id = c(1227604, 
1227604, 1227604, 1227604, 1227605, 1227605, 1227605, 1227605, 
1227605, 1227605), mobile_id = c(68409L, 68409L, 68409L, 68409L, 
68453L, 68453L, 68453L, 68453L, 68453L, 68453L), status = c("finished", 
"finished", "timeout", "failed", "failed", "finished", "finished", 
"timeout", "timeout", "timeout"), difftime = c(NA, -366, -668, 
-558, NA, -8, -63, -26, -11, -874), group = c(NA, 1, NA, 2, NA, 
3, 3, 3, 3, NA), rank = c(NA, 1, NA, 1, NA, 1, 2, 3, 4, NA)), row.names = c(NA, 
-10L), class = "data.frame")

You can use cumsum to define a variable that increases when the observations are not consecutive based on created_at within the same group. By grouping on this new variable, too, it is easy to create the rank indices:

library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
  group_by(user_id, mobile_id) %>% 
  arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
  mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
  group_by(user_id, mobile_id, d) %>%
  mutate(rank = row_number()-1) ## rank id

Then the easiest way to create group indices is with dplyr::group_indices :

dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)

I'm not sure why you would want to set the first instances of the indicators to NA but you can do it based on the values of rank .

> mutate(dat2, group = ifelse(rank == 0, NA, group),
+        rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups:   user_id, mobile_id, d [4]
   created_at           user_id mobile_id status   difftime group rank     d
   <dttm>                 <dbl>     <int> <chr>       <dbl> <int> <dbl> <dbl>
 1 2019-01-02 22:01:38 1227604.     68409 finished      NA     NA   NA     0.
 2 2019-01-03 04:08:29 1227604.     68409 finished    -366.     1    1.    0.
 3 2019-01-03 15:16:38 1227604.     68409 timeout     -668.    NA   NA     1.
 4 2019-01-04 00:34:40 1227604.     68409 failed      -558.     2    1.    1.
 5 2019-01-04 00:27:37 1227605.     68453 failed        NA     NA   NA     0.
 6 2019-01-04 00:35:56 1227605.     68453 finished      -8.     3    1.    0.
 7 2019-01-04 01:39:52 1227605.     68453 finished     -63.     3    2.    0.
 8 2019-01-04 02:05:53 1227605.     68453 timeout      -26.     3    3.    0.
 9 2019-01-04 02:17:17 1227605.     68453 timeout      -11.     3    4.    0.
10 2019-01-04 16:51:39 1227605.     68453 timeout     -874.    NA   NA     1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM