简体   繁体   English

如果连续值满足条件则进行分组

[英]Grouping if consecutive values meet condition

With the following table dat , my objective is to group by user_id and mobile_id only where there is a continuous sequence of values where difftime > - 600 . 使用下表dat ,我的目标是difftime > - 600的连续值序列中按user_idmobile_id difftime > - 600 The sequence must be consecutive in created_at , and given a rank. 该序列必须在created_at是连续的,并具有一个等级。 Each separate group would be assigned an incremental value, For example : 将为每个单独的组分配一个增量值,例如:

> dat
            created_at user_id mobile_id   status difftime
1  2019-01-02 22:01:38 1227604     68409 finished      \\N
2  2019-01-03 04:08:29 1227604     68409 finished     -366
3  2019-01-03 15:16:38 1227604     68409  timeout     -668
4  2019-01-04 00:34:40 1227604     68409   failed     -558
5  2019-01-04 00:27:37 1227605     68453   failed      \\N
6  2019-01-04 00:35:56 1227605     68453 finished       -8
7  2019-01-04 01:39:52 1227605     68453 finished      -63
8  2019-01-04 02:05:53 1227605     68453  timeout      -26
9  2019-01-04 02:17:17 1227605     68453  timeout      -11
10 2019-01-04 16:51:39 1227605     68453  timeout     -874

Would create an output of 将创建一个输出

> output
            created_at user_id mobile_id   status difftime group rank
1  2019-01-02 22:01:38 1227604     68409 finished      \\N    NA   NA
2  2019-01-03 04:08:29 1227604     68409 finished     -366     1    1
3  2019-01-03 15:16:38 1227604     68409  timeout     -668    NA   NA
4  2019-01-04 00:34:40 1227604     68409   failed     -558     2    1
5  2019-01-04 00:27:37 1227605     68453   failed      \\N    NA   NA
6  2019-01-04 00:35:56 1227605     68453 finished       -8     3    1
7  2019-01-04 01:39:52 1227605     68453 finished      -63     3    2
8  2019-01-04 02:05:53 1227605     68453  timeout      -26     3    3
9  2019-01-04 02:17:17 1227605     68453  timeout      -11     3    4
10 2019-01-04 16:51:39 1227605     68453  timeout     -874    NA   NA

I am not sure where to begin, beyond a simple grouping in dplyr . 除了在dplyr进行简单的分组以外,我不确定从哪里开始。 How would one go about assigning a group and rank ? 如何分配一个组和等级?

dat %>%
  group_by(user_id, mobile_id) %>%
  arrange(created_at) %>%
  filter(difftime > -600)

The data: 数据:

> dput(dat)
structure(list(created_at = structure(c(1546466498.138, 1546488509.218, 
1546528598.628, 1546562080.81, 1546561657.567, 1546562156.632, 
1546565992.788, 1546567553.811, 1546568237.325, 1546620699.964
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), user_id = c(1227604, 
1227604, 1227604, 1227604, 1227605, 1227605, 1227605, 1227605, 
1227605, 1227605), mobile_id = c(68409L, 68409L, 68409L, 68409L, 
68453L, 68453L, 68453L, 68453L, 68453L, 68453L), status = c("finished", 
"finished", "timeout", "failed", "failed", "finished", "finished", 
"timeout", "timeout", "timeout"), difftime = c(NA, -366, -668, 
-558, NA, -8, -63, -26, -11, -874), group = c(NA, 1, NA, 2, NA, 
3, 3, 3, 3, NA), rank = c(NA, 1, NA, 1, NA, 1, 2, 3, 4, NA)), row.names = c(NA, 
-10L), class = "data.frame")

You can use cumsum to define a variable that increases when the observations are not consecutive based on created_at within the same group. 您可以使用cumsum定义一个变量,当观察值基于同一组中的created_at不连续时,该变量将增加。 By grouping on this new variable, too, it is easy to create the rank indices: 通过对这个新变量进行分组,创建排名索引也很容易:

library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
  group_by(user_id, mobile_id) %>% 
  arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
  mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
  group_by(user_id, mobile_id, d) %>%
  mutate(rank = row_number()-1) ## rank id

Then the easiest way to create group indices is with dplyr::group_indices : 然后,创建组索引的最简单方法是使用dplyr::group_indices

dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)

I'm not sure why you would want to set the first instances of the indicators to NA but you can do it based on the values of rank . 我不确定为什么要将指标的第一个实例设置为NA但是可以根据rank的值进行操作。

> mutate(dat2, group = ifelse(rank == 0, NA, group),
+        rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups:   user_id, mobile_id, d [4]
   created_at           user_id mobile_id status   difftime group rank     d
   <dttm>                 <dbl>     <int> <chr>       <dbl> <int> <dbl> <dbl>
 1 2019-01-02 22:01:38 1227604.     68409 finished      NA     NA   NA     0.
 2 2019-01-03 04:08:29 1227604.     68409 finished    -366.     1    1.    0.
 3 2019-01-03 15:16:38 1227604.     68409 timeout     -668.    NA   NA     1.
 4 2019-01-04 00:34:40 1227604.     68409 failed      -558.     2    1.    1.
 5 2019-01-04 00:27:37 1227605.     68453 failed        NA     NA   NA     0.
 6 2019-01-04 00:35:56 1227605.     68453 finished      -8.     3    1.    0.
 7 2019-01-04 01:39:52 1227605.     68453 finished     -63.     3    2.    0.
 8 2019-01-04 02:05:53 1227605.     68453 timeout      -26.     3    3.    0.
 9 2019-01-04 02:17:17 1227605.     68453 timeout      -11.     3    4.    0.
10 2019-01-04 16:51:39 1227605.     68453 timeout     -874.    NA   NA     1.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM