[英]group by consecutive values in r
I've got a dataset coming from a support ticketing system that logs each click made by an agent in classifying and responding to customer requests. 我有一个来自支持票务系统的数据集,它记录了代理商在分类和响应客户请求时所做的每次点击。 The system assigns a new hist_id to each click, but an agent will click several fields, triggering several rows in the table, in what they consider a single "interaction".
系统会为每次点击分配一个新的hist_id,但代理会点击几个字段,触发表格中的多行,他们认为是单个“交互”。
My goal is to calculate a handle time for each of these interaction by doing a diff on the first and last modify_time values in each group. 我的目标是通过对每个组中的第一个和最后一个modify_time值执行diff来计算每个交互的句柄时间。
I'm stuck currently because an agent will have multiple interactions with a case throughout the day. 我目前陷入困境,因为代理人将全天与案件进行多次互动。
Here's a sample dataframe: 这是一个示例数据帧:
hist_id <- c(1234, 2345, 3456, 4567, 5678, 6789, 7890)
case_id <- c(1, 1, 1, 1, 1, 1, 1)
agent_name <- c("John", "John", "John", "Paul", "Paul", "John", "John")
modify_time <- as.POSIXct(c(1510095120, 1510095180, 1510095240, 1510098600, 1510098720, 1510135200, 1510135320), origin = "1970-01-01")
df <- data.frame(hist_id, case_id, agent_name, modify_time)
Using group by on the case_id and agent_name groups all rows that match the criteria, as expected: 使用group_id和agent_name上的group by按预期分组符合条件的所有行:
df %>% group_by(case_id, agent_name) %>% mutate(first = first(modify_time), last = last(modify_time), diff = min(difftime(last, first)))
Which gives me this: 这给了我这个:
# A tibble: 7 x 7
# Groups: case_id, agent_name [2]
hist_id case_id agent_name modify_time first last diff
<dbl> <dbl> <fctr> <dttm> <dttm> <dttm> <time>
1 1234 1 John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
2 2345 1 John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
3 3456 1 John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
4 4567 1 Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
5 5678 1 Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
6 6789 1 John 2017-11-08 04:00:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
7 7890 1 John 2017-11-08 04:02:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
Where John's true first and last modify_times are returned. John返回真正的第一个和最后一个modify_times。 However, I need to group the consecutive matches of case_id and agent_name, so that Paul's interaction is considered.
但是,我需要对case_id和agent_name的连续匹配进行分组,以便考虑Paul的交互。 So three interactions are recorded here: one from John, one from Paul, and a second by John.
所以这里记录了三个互动:一个来自John,一个来自Paul,另一个来自John。
Desired output would be something like this: 期望的输出将是这样的:
# A tibble: 7 x 7
# Groups: case_id, agent_name [2]
hist_id case_id agent_name modify_time first last diff
<dbl> <dbl> <fctr> <dttm> <dttm> <dttm> <time>
1 1234 1 John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
2 2345 1 John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
3 3456 1 John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
4 4567 1 Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
5 5678 1 Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
6 6789 1 John 2017-11-08 04:00:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs
7 7890 1 John 2017-11-08 04:02:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs
Here is a tidyverse approach that partitions the groups by the processing cluster identity
, as well as case_id
, and agent_name
: 这是一个tidyverse方法,它按
processing cluster identity
以及case_id
和agent_name
对组进行分区:
Arranging all the click in sequence, generate a new id flag for each time that a hist_id
sequence encounters a transition to a new agent_name
. 按顺序排列所有单击,每次
hist_id
序列遇到到新agent_name
的转换时, hist_id
生成一个新的id标志。 cumsum
those flags to generate a unique prcl_id
per case, per agent, per cluster processing chunk. cumsum
这些标志为每个代理生成一个唯一的prcl_id
,每个集群处理块。 With all three id's you can then run your chosen mutations within the desired partitions. 使用所有三个id,您可以在所需的分区中运行您选择的突变。
df %>%
arrange(hist_id) %>% # to ensure there are no wrinkles
mutate(ag_chg_flg = ifelse(lag(agent_name) != agent_name, 1, 0) %>%
coalesce(0) # to reassign the first click in a case_id to 0 (from NA)
) %>%
group_by(case_id, agent_name) %>%
mutate(prcl_id = cumsum(ag_chg_flg) + 1) %>% # generate the proc_clst_id (starting at 1)
group_by(case_id, agent_name, prcl_id) %>% # group by the complete composite id
mutate(first = first(modify_time),
last = last(modify_time),
diff = min(difftime(last, first))
)
Which gets you: 哪个让你:
# A tibble: 7 x 9 # Groups: case_id, agent_name, prcl_id [3] hist_id case_id agent_name modify_time ag_chg_flg prcl_id first last diff <dbl> <dbl> <fctr> <dttm> <dbl> <dbl> <dttm> <dttm> <time> 1 1234 1 John 2017-11-07 14:52:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 2 2345 1 John 2017-11-07 14:53:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 3 3456 1 John 2017-11-07 14:54:00 0 1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins 4 4567 1 Paul 2017-11-07 15:50:00 1 2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins 5 5678 1 Paul 2017-11-07 15:52:00 0 2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins 6 6789 1 John 2017-11-08 02:00:00 1 2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins 7 7890 1 John 2017-11-08 02:02:00 0 2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.