[英]Using dplyr to check values of multiple rows that meet a condition (ex- all rows where the date column falls in a specified period)
I have a dataset of event ids, the event type, and the time of the event.我有一个事件 ID、事件类型和事件时间的数据集。 The events consist of "start" and "pause".
事件包括“开始”和“暂停”。 I would like to identify "pause" events that are not followed by a "start" event within 7 days and classify these as "stops".
我想确定 7 天内没有跟随“开始”事件的“暂停”事件,并将这些事件归类为“停止”。
Here is the code for the test dataset:下面是测试数据集的代码:
test <- data.frame("id" = 1:5,
"event" = c("pause",
"pause",
"start",
"pause",
"start"),
"time" = dmy("03-11-2012",
"05-11-2012",
"06-11-2012",
"21-11-2012",
"30-11-2012"))
So far, I used lead() to check if the following event was a "start" event AND happened within 7 days.到目前为止,我使用 Lead() 来检查以下事件是否是“开始”事件并且在 7 天内发生。 However, I realized that sometimes a "pause" event was followed by another "pause" event and then a "start" event, all within 7 days.
但是,我意识到有时“暂停”事件之后是另一个“暂停”事件,然后是“开始”事件,所有这些都在 7 天内。 Both "pause" events in this case should not be considered to be a stop.
在这种情况下,两个“暂停”事件不应该被认为是停止。 This means that I need to check all events/rows that occurred within 7 days of the "pause" event and look for a "start" event.
这意味着我需要检查“暂停”事件后 7 天内发生的所有事件/行并查找“开始”事件。
I am looking for a function I can use within dplyr (I'll use non-dplyr solutions if I have to) where I can check the value of multiple rows.我正在寻找一个可以在 dplyr 中使用的函数(如果需要,我将使用非 dplyr 解决方案),我可以在其中检查多行的值。
My solution so far using lead(), which checks the immediate next row only.到目前为止,我的解决方案使用了 Lead(),它仅检查紧邻的下一行。
test2 <- test %>%
mutate(stop = ifelse(event == "pause" &
!((time + days(7) > lead(time)) &
lead(event) == "start"),
"yes",
"no"))
This gives这给
|id|event|time |stop|
|------------------------|
|1 |pause|2012-11-03|yes |
|2 |pause|2012-11-05|no |
|3 |start|2012-11-06|no |
|4 |pause|2012-11-21|yes |
|5 |start|2012-11-30|no |
I would like the stop column value for the first "pause" to also be a "no" because it has a "start" event within 7 days of it.我希望第一个“暂停”的停止列值也为“否”,因为它在 7 天内有一个“开始”事件。
If you want to do this inside a dplyr
function, you can sapply
inside a mutate
:如果要在
dplyr
函数中执行此dplyr
,可以在mutate
sapply
:
test %>%
mutate(stop = sapply(seq_along(time),
function(i) {
if(event[i] != "pause") return(FALSE)
ind <- which(time > time[i] & event == "start")
if(length(ind) == 0) return(FALSE)
as.numeric(difftime(time[ind[1]], time[i], units = "day")) > 7
}))
#> id event time stop
#> 1 1 pause 2012-11-03 FALSE
#> 2 2 pause 2012-11-05 FALSE
#> 3 3 start 2012-11-06 FALSE
#> 4 4 pause 2012-11-21 TRUE
#> 5 5 start 2012-11-30 FALSE
Although it might get slow with large dataset, this might do the work:尽管大型数据集可能会变慢,但这可能会起作用:
library(dplyr)
library(purrr)
test %>%
mutate(
stop = ifelse(event=="pause" & !((time + days(7) > lead(time)) & lead(event) == "start"),
"yes", "no"),
stop2 = ifelse(map_lgl(row_number(),
~any(event=="start" & time>=time[.x] & time<=time[.x] + days(7))),
"no", "yes")
)
# id event time stop stop2
# 1 1 pause 2012-11-03 yes no
# 2 2 pause 2012-11-05 no no
# 3 3 start 2012-11-06 no no
# 4 4 pause 2012-11-21 yes yes
# 5 5 start 2012-11-30 no no
Using row_number()
and time[.x]
, this allows to consider every row independently.使用
row_number()
和time[.x]
,这允许独立考虑每一行。 Then, we just check if there is any "start" between "now" and "in 7 days" and put the right value accordingly.然后,我们只检查“现在”和“7 天后”之间是否有任何“开始”,并相应地输入正确的值。
purrr::map_lgl
allows to loop over every row and return a logical vector. purrr::map_lgl
允许遍历每一行并返回一个逻辑向量。
The slowness comes from the fact that you have to check for all the rows each time you want to compute the value for one row.缓慢的原因是每次要计算一行的值时都必须检查所有行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.