[英]Grouping if consecutive values meet condition
使用下表dat
,我的目標是僅在difftime > - 600
的連續值序列中按user_id
和mobile_id
difftime > - 600
。 該序列必須在created_at
是連續的,並具有一個等級。 將為每個單獨的組分配一個增量值,例如:
> dat
created_at user_id mobile_id status difftime
1 2019-01-02 22:01:38 1227604 68409 finished \\N
2 2019-01-03 04:08:29 1227604 68409 finished -366
3 2019-01-03 15:16:38 1227604 68409 timeout -668
4 2019-01-04 00:34:40 1227604 68409 failed -558
5 2019-01-04 00:27:37 1227605 68453 failed \\N
6 2019-01-04 00:35:56 1227605 68453 finished -8
7 2019-01-04 01:39:52 1227605 68453 finished -63
8 2019-01-04 02:05:53 1227605 68453 timeout -26
9 2019-01-04 02:17:17 1227605 68453 timeout -11
10 2019-01-04 16:51:39 1227605 68453 timeout -874
將創建一個輸出
> output
created_at user_id mobile_id status difftime group rank
1 2019-01-02 22:01:38 1227604 68409 finished \\N NA NA
2 2019-01-03 04:08:29 1227604 68409 finished -366 1 1
3 2019-01-03 15:16:38 1227604 68409 timeout -668 NA NA
4 2019-01-04 00:34:40 1227604 68409 failed -558 2 1
5 2019-01-04 00:27:37 1227605 68453 failed \\N NA NA
6 2019-01-04 00:35:56 1227605 68453 finished -8 3 1
7 2019-01-04 01:39:52 1227605 68453 finished -63 3 2
8 2019-01-04 02:05:53 1227605 68453 timeout -26 3 3
9 2019-01-04 02:17:17 1227605 68453 timeout -11 3 4
10 2019-01-04 16:51:39 1227605 68453 timeout -874 NA NA
除了在dplyr
進行簡單的分組以外,我不確定從哪里開始。 如何分配一個組和等級?
dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at) %>%
filter(difftime > -600)
數據:
> dput(dat)
structure(list(created_at = structure(c(1546466498.138, 1546488509.218,
1546528598.628, 1546562080.81, 1546561657.567, 1546562156.632,
1546565992.788, 1546567553.811, 1546568237.325, 1546620699.964
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), user_id = c(1227604,
1227604, 1227604, 1227604, 1227605, 1227605, 1227605, 1227605,
1227605, 1227605), mobile_id = c(68409L, 68409L, 68409L, 68409L,
68453L, 68453L, 68453L, 68453L, 68453L, 68453L), status = c("finished",
"finished", "timeout", "failed", "failed", "finished", "finished",
"timeout", "timeout", "timeout"), difftime = c(NA, -366, -668,
-558, NA, -8, -63, -26, -11, -874), group = c(NA, 1, NA, 2, NA,
3, 3, 3, 3, NA), rank = c(NA, 1, NA, 1, NA, 1, 2, 3, 4, NA)), row.names = c(NA,
-10L), class = "data.frame")
您可以使用cumsum
定義一個變量,當觀察值基於同一組中的created_at
不連續時,該變量將增加。 通過對這個新變量進行分組,創建排名索引也很容易:
library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
group_by(user_id, mobile_id, d) %>%
mutate(rank = row_number()-1) ## rank id
然后,創建組索引的最簡單方法是使用dplyr::group_indices
:
dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)
我不確定為什么要將指標的第一個實例設置為NA
但是可以根據rank
的值進行操作。
> mutate(dat2, group = ifelse(rank == 0, NA, group),
+ rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups: user_id, mobile_id, d [4]
created_at user_id mobile_id status difftime group rank d
<dttm> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 2019-01-02 22:01:38 1227604. 68409 finished NA NA NA 0.
2 2019-01-03 04:08:29 1227604. 68409 finished -366. 1 1. 0.
3 2019-01-03 15:16:38 1227604. 68409 timeout -668. NA NA 1.
4 2019-01-04 00:34:40 1227604. 68409 failed -558. 2 1. 1.
5 2019-01-04 00:27:37 1227605. 68453 failed NA NA NA 0.
6 2019-01-04 00:35:56 1227605. 68453 finished -8. 3 1. 0.
7 2019-01-04 01:39:52 1227605. 68453 finished -63. 3 2. 0.
8 2019-01-04 02:05:53 1227605. 68453 timeout -26. 3 3. 0.
9 2019-01-04 02:17:17 1227605. 68453 timeout -11. 3 4. 0.
10 2019-01-04 16:51:39 1227605. 68453 timeout -874. NA NA 1.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.