[英]group by consecutive values in r
我有一個來自支持票務系統的數據集,它記錄了代理商在分類和響應客戶請求時所做的每次點擊。 系統會為每次點擊分配一個新的hist_id,但代理會點擊幾個字段,觸發表格中的多行,他們認為是單個“交互”。
我的目標是通過對每個組中的第一個和最后一個modify_time值執行diff來計算每個交互的句柄時間。
我目前陷入困境,因為代理人將全天與案件進行多次互動。
這是一個示例數據幀:
hist_id <- c(1234, 2345, 3456, 4567, 5678, 6789, 7890)
case_id <- c(1, 1, 1, 1, 1, 1, 1)
agent_name <- c("John", "John", "John", "Paul", "Paul", "John", "John")
modify_time <- as.POSIXct(c(1510095120, 1510095180, 1510095240, 1510098600, 1510098720, 1510135200, 1510135320), origin = "1970-01-01")
df <- data.frame(hist_id, case_id, agent_name, modify_time)
使用group_id和agent_name上的group by按預期分組符合條件的所有行:
df %>% group_by(case_id, agent_name) %>% mutate(first = first(modify_time), last = last(modify_time), diff = min(difftime(last, first)))
這給了我這個:
# A tibble: 7 x 7
# Groups: case_id, agent_name [2]
hist_id case_id agent_name modify_time first last diff
<dbl> <dbl> <fctr> <dttm> <dttm> <dttm> <time>
1 1234 1 John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
2 2345 1 John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
3 3456 1 John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
4 4567 1 Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
5 5678 1 Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
6 6789 1 John 2017-11-08 04:00:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
7 7890 1 John 2017-11-08 04:02:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
John返回真正的第一個和最后一個modify_times。 但是,我需要對case_id和agent_name的連續匹配進行分組,以便考慮Paul的交互。 所以這里記錄了三個互動:一個來自John,一個來自Paul,另一個來自John。
期望的輸出將是這樣的:
# A tibble: 7 x 7
# Groups: case_id, agent_name [2]
hist_id case_id agent_name modify_time first last diff
<dbl> <dbl> <fctr> <dttm> <dttm> <dttm> <time>
1 1234 1 John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
2 2345 1 John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
3 3456 1 John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
4 4567 1 Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
5 5678 1 Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
6 6789 1 John 2017-11-08 04:00:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs
7 7890 1 John 2017-11-08 04:02:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs
這是一個tidyverse方法,它按processing cluster identity
以及case_id
和agent_name
對組進行分區:
按順序排列所有單擊,每次hist_id
序列遇到到新agent_name
的轉換時, hist_id
生成一個新的id標志。 cumsum
這些標志為每個代理生成一個唯一的prcl_id
,每個集群處理塊。 使用所有三個id,您可以在所需的分區中運行您選擇的突變。
df %>%
arrange(hist_id) %>% # to ensure there are no wrinkles
mutate(ag_chg_flg = ifelse(lag(agent_name) != agent_name, 1, 0) %>%
coalesce(0) # to reassign the first click in a case_id to 0 (from NA)
) %>%
group_by(case_id, agent_name) %>%
mutate(prcl_id = cumsum(ag_chg_flg) + 1) %>% # generate the proc_clst_id (starting at 1)
group_by(case_id, agent_name, prcl_id) %>% # group by the complete composite id
mutate(first = first(modify_time),
last = last(modify_time),
diff = min(difftime(last, first))
)
哪個讓你:
# A tibble: 7 x 9 # Groups: case_id, agent_name, prcl_id [3] hist_id case_id agent_name modify_time ag_chg_flg prcl_id first last diff <dbl> <dbl> <fctr> <dttm> <dbl> <dbl> <dttm> <dttm> <time> 1 1234 1 John 2017-11-07 14:52:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 2 2345 1 John 2017-11-07 14:53:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 3 3456 1 John 2017-11-07 14:54:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 4 4567 1 Paul 2017-11-07 15:50:00 1 2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins 5 5678 1 Paul 2017-11-07 15:52:00 0 2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins 6 6789 1 John 2017-11-08 02:00:00 1 2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins 7 7890 1 John 2017-11-08 02:02:00 0 2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.