[英]Grouping if consecutive values meet condition
使用下表dat
,我的目标是仅在difftime > - 600
的连续值序列中按user_id
和mobile_id
difftime > - 600
。 该序列必须在created_at
是连续的,并具有一个等级。 将为每个单独的组分配一个增量值,例如:
> dat
created_at user_id mobile_id status difftime
1 2019-01-02 22:01:38 1227604 68409 finished \\N
2 2019-01-03 04:08:29 1227604 68409 finished -366
3 2019-01-03 15:16:38 1227604 68409 timeout -668
4 2019-01-04 00:34:40 1227604 68409 failed -558
5 2019-01-04 00:27:37 1227605 68453 failed \\N
6 2019-01-04 00:35:56 1227605 68453 finished -8
7 2019-01-04 01:39:52 1227605 68453 finished -63
8 2019-01-04 02:05:53 1227605 68453 timeout -26
9 2019-01-04 02:17:17 1227605 68453 timeout -11
10 2019-01-04 16:51:39 1227605 68453 timeout -874
将创建一个输出
> output
created_at user_id mobile_id status difftime group rank
1 2019-01-02 22:01:38 1227604 68409 finished \\N NA NA
2 2019-01-03 04:08:29 1227604 68409 finished -366 1 1
3 2019-01-03 15:16:38 1227604 68409 timeout -668 NA NA
4 2019-01-04 00:34:40 1227604 68409 failed -558 2 1
5 2019-01-04 00:27:37 1227605 68453 failed \\N NA NA
6 2019-01-04 00:35:56 1227605 68453 finished -8 3 1
7 2019-01-04 01:39:52 1227605 68453 finished -63 3 2
8 2019-01-04 02:05:53 1227605 68453 timeout -26 3 3
9 2019-01-04 02:17:17 1227605 68453 timeout -11 3 4
10 2019-01-04 16:51:39 1227605 68453 timeout -874 NA NA
除了在dplyr
进行简单的分组以外,我不确定从哪里开始。 如何分配一个组和等级?
dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at) %>%
filter(difftime > -600)
数据:
> dput(dat)
structure(list(created_at = structure(c(1546466498.138, 1546488509.218,
1546528598.628, 1546562080.81, 1546561657.567, 1546562156.632,
1546565992.788, 1546567553.811, 1546568237.325, 1546620699.964
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), user_id = c(1227604,
1227604, 1227604, 1227604, 1227605, 1227605, 1227605, 1227605,
1227605, 1227605), mobile_id = c(68409L, 68409L, 68409L, 68409L,
68453L, 68453L, 68453L, 68453L, 68453L, 68453L), status = c("finished",
"finished", "timeout", "failed", "failed", "finished", "finished",
"timeout", "timeout", "timeout"), difftime = c(NA, -366, -668,
-558, NA, -8, -63, -26, -11, -874), group = c(NA, 1, NA, 2, NA,
3, 3, 3, 3, NA), rank = c(NA, 1, NA, 1, NA, 1, 2, 3, 4, NA)), row.names = c(NA,
-10L), class = "data.frame")
您可以使用cumsum
定义一个变量,当观察值基于同一组中的created_at
不连续时,该变量将增加。 通过对这个新变量进行分组,创建排名索引也很容易:
library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
group_by(user_id, mobile_id, d) %>%
mutate(rank = row_number()-1) ## rank id
然后,创建组索引的最简单方法是使用dplyr::group_indices
:
dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)
我不确定为什么要将指标的第一个实例设置为NA
但是可以根据rank
的值进行操作。
> mutate(dat2, group = ifelse(rank == 0, NA, group),
+ rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups: user_id, mobile_id, d [4]
created_at user_id mobile_id status difftime group rank d
<dttm> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 2019-01-02 22:01:38 1227604. 68409 finished NA NA NA 0.
2 2019-01-03 04:08:29 1227604. 68409 finished -366. 1 1. 0.
3 2019-01-03 15:16:38 1227604. 68409 timeout -668. NA NA 1.
4 2019-01-04 00:34:40 1227604. 68409 failed -558. 2 1. 1.
5 2019-01-04 00:27:37 1227605. 68453 failed NA NA NA 0.
6 2019-01-04 00:35:56 1227605. 68453 finished -8. 3 1. 0.
7 2019-01-04 01:39:52 1227605. 68453 finished -63. 3 2. 0.
8 2019-01-04 02:05:53 1227605. 68453 timeout -26. 3 3. 0.
9 2019-01-04 02:17:17 1227605. 68453 timeout -11. 3 4. 0.
10 2019-01-04 16:51:39 1227605. 68453 timeout -874. NA NA 1.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.