[英]Identify consecutive duplicates in R using accumulate
Let me share an example of what I'm trying to do, since the title may not be as clear as I'd like it to be.让我分享一个我正在尝试做的示例,因为标题可能不像我希望的那样清晰。
data <- tibble(week=1:10,name=c(rep("Joe",10)),value=c(.9,.89,.99,.98,.87,.89,.93,.92,.98,.9),
wanted = c("Yes","Skip","No","No","Yes","Skip","Yes","Skip","No","Yes"))
data <- data %>% mutate(my_attempt = case_when( week-lag(week)==1 &
value < .95 &
lag(value) < .95 &
lag(value,2) >= .95 &
!is.na(lag(value,2))~ "Skip",
week-lag(week)==1 &
value < .95 &
lag(value) < .95 &
is.na(lag(value,2))~ "Skip",
value < .95 ~"Yes",
TRUE ~ "No"))
# week name value wanted my_attempt
# <int> <chr> <dbl> <chr> <chr>
# 1 Joe 0.9 Yes Yes
# 2 Joe 0.89 Skip Skip
# 3 Joe 0.99 No No
# 4 Joe 0.98 No No
# 5 Joe 0.87 Yes Yes
# 6 Joe 0.89 Skip Skip
# 7 Joe 0.93 Yes Yes
# 8 Joe 0.92 Skip Yes
# 9 Joe 0.98 No No
# 10 Joe 0.9 Yes Yes
I am trying to get the my_attempt column to produce the results of the wanted column.我试图让 my_attempt 列产生想要的列的结果。 I want to identify rows when the value is less than a certain threshold, but there can't be two consecutive "yes" values.
我想在值小于某个阈值时识别行,但不能有两个连续的“是”值。 My attempt at it works until it sees 4 or more low values in a row.
我对它的尝试一直有效,直到它连续看到 4 个或更多的低值。 In my real data some weeks may be missing but that can be treated as a "No".
在我的真实数据中,可能会丢失一些星期,但这可以被视为“否”。 For example, if week 6 was missing it would still be okay for week 7 to be "Yes" (I think the first line in my case when takes care of this).
例如,如果缺少第 6 周,则第 7 周仍然可以“是”(我认为在我的情况下是第一行)。 Is there a way to do this in R?
有没有办法在R中做到这一点? It doesn't have to be consistent with dplyr but it would be nice if it's possible within tidyverse.
它不必与 dplyr 保持一致,但如果可以在 tidyverse 中使用,那就太好了。
I think you can use purrr:accumulate()
here我想你可以在这里使用
purrr:accumulate()
library(purrr)
library(dplyr)
data%>%mutate(my_attempt = ifelse(week-lag(week, default = 0)==1 &
value < .95,
'Yes', 'No')%>%
accumulate(~ifelse(.x==.y & .y=='Yes', 'Skip', .y)))
# A tibble: 10 x 5
week name value wanted my_attempt
<int> <chr> <dbl> <chr> <chr>
1 1 Joe 0.9 Yes Yes
2 2 Joe 0.89 Skip Skip
3 3 Joe 0.99 No No
4 4 Joe 0.98 No No
5 5 Joe 0.87 Yes Yes
6 6 Joe 0.89 Skip Skip
7 7 Joe 0.93 Yes Yes
8 8 Joe 0.92 Skip Skip
9 9 Joe 0.98 No No
10 10 Joe 0.9 Yes Yes
Here is a simple dplyr
solution:这是一个简单的
dplyr
解决方案:
library(dplyr)
data %>%
mutate(grp = cummax(week - lag(week, default = 0))) %>%
group_by(name, grp) %>%
mutate(my_attempt = ifelse(value < 0.95 & lag(value, default = 1) < 0.95, "Skip",
ifelse(value < 0.95 & lag(value, default = 1) >= 0.95,
"Yes", "No")))
# A tibble: 9 x 6
# Groups: name, grp [2]
week name value wanted grp my_attempt
<int> <chr> <dbl> <chr> <dbl> <chr>
1 1 Joe 0.9 Yes 1 Yes
2 2 Joe 0.89 Skip 1 Skip
3 3 Joe 0.99 No 1 No
4 4 Joe 0.98 No 1 No
5 5 Joe 0.87 Yes 1 Yes
6 7 Joe 0.93 Yes 2 Yes
7 8 Joe 0.92 Skip 2 Skip
8 9 Joe 0.98 No 2 No
9 10 Joe 0.9 Yes 2 Yes
And here is how you could do it with base::Reduce
on a data set with missing week values.下面是如何在缺少周值的数据集上使用
base::Reduce
来完成此操作。 I first created a grouping grp
based on the difference between week values and then split
the data set based on the grouping variable.我首先根据周值之间的差异创建了一个分组
grp
,然后根据分组变量split
数据集。 After that I applied our function on every chunk and bind the result with rbind
:之后,我将我们的函数应用于每个块并使用
rbind
绑定结果:
do.call(rbind, lapply(split(data, cummax(abs(data$week - c(0, data$week[-nrow(data)]))), data$name),
\(x){
x$my_attept <- Reduce(function(a, b) {
if(x$value[b] < 0.95 & a != "Yes") {
"Yes"
} else if(x$value[b] < 0.95 & a == "Yes") {
"Skip"
} else {
"No"
}
}, 2:nrow(x), init = ifelse(x$value[1] < 0.95, "Yes", "No"), accumulate = TRUE)
x
}))
# A tibble: 9 x 5
week name value wanted my_attept
* <int> <chr> <dbl> <chr> <chr>
1 1 Joe 0.9 Yes Yes
2 2 Joe 0.89 Skip Skip
3 3 Joe 0.99 No No
4 4 Joe 0.98 No No
5 5 Joe 0.87 Yes Yes
6 7 Joe 0.93 Yes Yes
7 8 Joe 0.92 Skip Skip
8 9 Joe 0.98 No No
9 10 Joe 0.9 Yes Yes
In case you have missing weeks in your data like the modified data set here you can use the following solution.如果您的数据中缺少数周,例如此处修改的数据集,您可以使用以下解决方案。 We first group weeks based on their consecutive values and then apply our solution on each group:
我们首先根据连续值对周进行分组,然后将我们的解决方案应用于每组:
data %>%
mutate(grp = cummax(week - lag(week, default = 0))) %>%
group_by(name, grp) %>%
mutate(my_attept = accumulate(value[-1], .init = ifelse(value[1] < 0.95, "Yes", "No"),
~ if(.y < 0.95 & .x != "Yes") {
"Yes"
} else if(.y < 0.95 & .x == "Yes") {
"Skip"
} else {
"No"
}))
# A tibble: 9 x 6
# Groups: grp [2]
week name value wanted grp my_attept
<int> <chr> <dbl> <chr> <dbl> <chr>
1 1 Joe 0.9 Yes 1 Yes
2 2 Joe 0.89 Skip 1 Skip
3 3 Joe 0.99 No 1 No
4 4 Joe 0.98 No 1 No
5 5 Joe 0.87 Yes 1 Yes
6 7 Joe 0.93 Yes 2 Yes
7 8 Joe 0.92 Skip 2 Skip
8 9 Joe 0.98 No 2 No
9 10 Joe 0.9 Yes 2 Yes
Data数据
structure(list(week = c(1L, 2L, 3L, 4L, 5L, 7L, 8L, 9L, 10L),
name = c("Joe", "Joe", "Joe", "Joe", "Joe", "Joe", "Joe",
"Joe", "Joe"), value = c(0.9, 0.89, 0.99, 0.98, 0.87, 0.93,
0.92, 0.98, 0.9), wanted = c("Yes", "Skip", "No", "No", "Yes",
"Yes", "Skip", "No", "Yes")), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
I would have done it with rolling computation library like slider
where missing data can be indexed well.我会用滚动计算库来完成它,比如
slider
,其中可以很好地索引丢失的数据。 Showing you on modified data向您展示修改后的数据
library(tidyverse)
data <- tibble(week=c(1:5, 7:10),name=c(rep("Joe",9)),value=c(.9,.89,.99,.98,.87,.93,.92,.98,.9),
wanted = c("Yes","Skip","No","No","Yes","Yes","Skip","No","Yes"))
data
#> # A tibble: 9 x 4
#> week name value wanted
#> <int> <chr> <dbl> <chr>
#> 1 1 Joe 0.9 Yes
#> 2 2 Joe 0.89 Skip
#> 3 3 Joe 0.99 No
#> 4 4 Joe 0.98 No
#> 5 5 Joe 0.87 Yes
#> 6 7 Joe 0.93 Yes
#> 7 8 Joe 0.92 Skip
#> 8 9 Joe 0.98 No
#> 9 10 Joe 0.9 Yes
library(slider)
data %>% group_by(name) %>%
mutate(wanted2 = case_when(value < 0.95 & slide_index_lgl(.x = value,
.i = week,
.f = ~ any(.x < 0.95),
.before = 1,
.after = -1) ~ 'skip',
value < 0.95 ~ 'yes',
TRUE ~ 'no'))
#> # A tibble: 9 x 5
#> # Groups: name [1]
#> week name value wanted wanted2
#> <int> <chr> <dbl> <chr> <chr>
#> 1 1 Joe 0.9 Yes yes
#> 2 2 Joe 0.89 Skip skip
#> 3 3 Joe 0.99 No no
#> 4 4 Joe 0.98 No no
#> 5 5 Joe 0.87 Yes yes
#> 6 7 Joe 0.93 Yes yes
#> 7 8 Joe 0.92 Skip skip
#> 8 9 Joe 0.98 No no
#> 9 10 Joe 0.9 Yes yes
Even it can be done without using slider
ie in dplyr
only即使不使用
slider
也可以完成,即仅在dplyr
library(dplyr)
data %>% group_by(name) %>%
mutate(wanted2 = case_when(value < 0.95 & lag(value, default = 1) < 0.95 & week - 1 == lag(week, default = 0) ~ 'Skip',
value < 0.95 ~ 'Yes',
TRUE ~ 'No'))
#> # A tibble: 9 x 5
#> # Groups: name [1]
#> week name value wanted wanted2
#> <int> <chr> <dbl> <chr> <chr>
#> 1 1 Joe 0.9 Yes Yes
#> 2 2 Joe 0.89 Skip Skip
#> 3 3 Joe 0.99 No No
#> 4 4 Joe 0.98 No No
#> 5 5 Joe 0.87 Yes Yes
#> 6 7 Joe 0.93 Yes Yes
#> 7 8 Joe 0.92 Skip Skip
#> 8 9 Joe 0.98 No No
#> 9 10 Joe 0.9 Yes Yes
Created on 2021-07-25 by the reprex package (v2.0.0)由reprex 包( v2.0.0 ) 于 2021 年 7 月 25 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.