简体   繁体   English

使用累加识别 R 中的连续重复项

[英]Identify consecutive duplicates in R using accumulate

Let me share an example of what I'm trying to do, since the title may not be as clear as I'd like it to be.让我分享一个我正在尝试做的示例,因为标题可能不像我希望的那样清晰。

data <- tibble(week=1:10,name=c(rep("Joe",10)),value=c(.9,.89,.99,.98,.87,.89,.93,.92,.98,.9),
               wanted = c("Yes","Skip","No","No","Yes","Skip","Yes","Skip","No","Yes"))

data <- data %>% mutate(my_attempt = case_when( week-lag(week)==1 & 
                                          value < .95 &
                                          lag(value) < .95 &
                                          lag(value,2) >= .95 &
                                          !is.na(lag(value,2))~ "Skip",
                                        week-lag(week)==1 & 
                                          value < .95 &
                                          lag(value) < .95 &
                                          is.na(lag(value,2))~ "Skip",
                                        value < .95 ~"Yes",
                                        TRUE ~ "No"))    

 #   week name  value wanted my_attempt
 #  <int> <chr> <dbl> <chr>  <chr>     
 #     1 Joe    0.9  Yes    Yes       
 #     2 Joe    0.89 Skip   Skip      
 #     3 Joe    0.99 No     No        
 #     4 Joe    0.98 No     No        
 #     5 Joe    0.87 Yes    Yes       
 #     6 Joe    0.89 Skip   Skip      
 #     7 Joe    0.93 Yes    Yes       
 #     8 Joe    0.92 Skip   Yes       
 #     9 Joe    0.98 No     No        
 #    10 Joe    0.9  Yes    Yes    

I am trying to get the my_attempt column to produce the results of the wanted column.我试图让 my_attempt 列产生想要的列的结果。 I want to identify rows when the value is less than a certain threshold, but there can't be two consecutive "yes" values.我想在值小于某个阈值时识别行,但不能有两个连续的“是”值。 My attempt at it works until it sees 4 or more low values in a row.我对它的尝试一直有效,直到它连续看到 4 个或更多的低值。 In my real data some weeks may be missing but that can be treated as a "No".在我的真实数据中,可能会丢失一些星期,但这可以被视为“否”。 For example, if week 6 was missing it would still be okay for week 7 to be "Yes" (I think the first line in my case when takes care of this).例如,如果缺少第 6 周,则第 7 周仍然可以“是”(我认为在我的情况下是第一行)。 Is there a way to do this in R?有没有办法在R中做到这一点? It doesn't have to be consistent with dplyr but it would be nice if it's possible within tidyverse.它不必与 dplyr 保持一致,但如果可以在 tidyverse 中使用,那就太好了。

I think you can use purrr:accumulate() here我想你可以在这里使用purrr:accumulate()

library(purrr)
library(dplyr)

data%>%mutate(my_attempt = ifelse(week-lag(week, default = 0)==1 & 
                                          value < .95,
                                  'Yes', 'No')%>%
        accumulate(~ifelse(.x==.y & .y=='Yes', 'Skip', .y)))

# A tibble: 10 x 5
    week name  value wanted my_attempt
   <int> <chr> <dbl> <chr>  <chr>     
 1     1 Joe    0.9  Yes    Yes       
 2     2 Joe    0.89 Skip   Skip      
 3     3 Joe    0.99 No     No        
 4     4 Joe    0.98 No     No        
 5     5 Joe    0.87 Yes    Yes       
 6     6 Joe    0.89 Skip   Skip      
 7     7 Joe    0.93 Yes    Yes       
 8     8 Joe    0.92 Skip   Skip      
 9     9 Joe    0.98 No     No        
10    10 Joe    0.9  Yes    Yes 

Here is a simple dplyr solution:这是一个简单的dplyr解决方案:

library(dplyr)

data %>%
  mutate(grp = cummax(week - lag(week, default = 0))) %>%
  group_by(name, grp) %>%
  mutate(my_attempt = ifelse(value < 0.95 & lag(value, default = 1) < 0.95, "Skip", 
                             ifelse(value < 0.95 & lag(value, default = 1) >= 0.95, 
                                    "Yes", "No")))

# A tibble: 9 x 6
# Groups:   name, grp [2]
   week name  value wanted   grp my_attempt
  <int> <chr> <dbl> <chr>  <dbl> <chr>     
1     1 Joe    0.9  Yes        1 Yes       
2     2 Joe    0.89 Skip       1 Skip      
3     3 Joe    0.99 No         1 No        
4     4 Joe    0.98 No         1 No        
5     5 Joe    0.87 Yes        1 Yes       
6     7 Joe    0.93 Yes        2 Yes       
7     8 Joe    0.92 Skip       2 Skip      
8     9 Joe    0.98 No         2 No        
9    10 Joe    0.9  Yes        2 Yes 

And here is how you could do it with base::Reduce on a data set with missing week values.下面是如何在缺少周值的数据集上使用base::Reduce来完成此操作。 I first created a grouping grp based on the difference between week values and then split the data set based on the grouping variable.我首先根据周值之间的差异创建了一个分组grp ,然后根据分组变量split数据集。 After that I applied our function on every chunk and bind the result with rbind :之后,我将我们的函数应用于每个块并使用rbind绑定结果:

do.call(rbind, lapply(split(data, cummax(abs(data$week - c(0, data$week[-nrow(data)]))), data$name), 
                      \(x){
                        x$my_attept <- Reduce(function(a, b) {
                          if(x$value[b] < 0.95 & a != "Yes") {
                            "Yes"
                          } else if(x$value[b] < 0.95 & a == "Yes") {
                            "Skip"
                          } else {
                            "No"
                          }
                        }, 2:nrow(x), init = ifelse(x$value[1] < 0.95, "Yes", "No"), accumulate = TRUE)
                        x
                      }))

# A tibble: 9 x 5
   week name  value wanted my_attept
* <int> <chr> <dbl> <chr>  <chr>    
1     1 Joe    0.9  Yes    Yes      
2     2 Joe    0.89 Skip   Skip     
3     3 Joe    0.99 No     No       
4     4 Joe    0.98 No     No       
5     5 Joe    0.87 Yes    Yes      
6     7 Joe    0.93 Yes    Yes      
7     8 Joe    0.92 Skip   Skip     
8     9 Joe    0.98 No     No       
9    10 Joe    0.9  Yes    Yes 

In case you have missing weeks in your data like the modified data set here you can use the following solution.如果您的数据中缺少数周,例如此处修改的数据集,您可以使用以下解决方案。 We first group weeks based on their consecutive values and then apply our solution on each group:我们首先根据连续值对周进行分组,然后将我们的解决方案应用于每组:

data %>%
  mutate(grp = cummax(week - lag(week, default = 0))) %>%
  group_by(name, grp) %>%
  mutate(my_attept = accumulate(value[-1], .init = ifelse(value[1] < 0.95, "Yes", "No"),
                                ~ if(.y < 0.95 & .x != "Yes") {
                                  "Yes"
                                } else if(.y < 0.95 & .x == "Yes") {
                                  "Skip"
                                } else {
                                  "No"
                                }))

# A tibble: 9 x 6
# Groups:   grp [2]
   week name  value wanted   grp my_attept
  <int> <chr> <dbl> <chr>  <dbl> <chr>    
1     1 Joe    0.9  Yes        1 Yes      
2     2 Joe    0.89 Skip       1 Skip     
3     3 Joe    0.99 No         1 No       
4     4 Joe    0.98 No         1 No       
5     5 Joe    0.87 Yes        1 Yes      
6     7 Joe    0.93 Yes        2 Yes      
7     8 Joe    0.92 Skip       2 Skip     
8     9 Joe    0.98 No         2 No       
9    10 Joe    0.9  Yes        2 Yes 

Data数据

structure(list(week = c(1L, 2L, 3L, 4L, 5L, 7L, 8L, 9L, 10L), 
    name = c("Joe", "Joe", "Joe", "Joe", "Joe", "Joe", "Joe", 
    "Joe", "Joe"), value = c(0.9, 0.89, 0.99, 0.98, 0.87, 0.93, 
    0.92, 0.98, 0.9), wanted = c("Yes", "Skip", "No", "No", "Yes", 
    "Yes", "Skip", "No", "Yes")), row.names = c(NA, -9L), class = c("tbl_df", 
"tbl", "data.frame"))

I would have done it with rolling computation library like slider where missing data can be indexed well.我会用滚动计算库来完成它,比如slider ,其中可以很好地索引丢失的数据。 Showing you on modified data向您展示修改后的数据

library(tidyverse)
data <- tibble(week=c(1:5, 7:10),name=c(rep("Joe",9)),value=c(.9,.89,.99,.98,.87,.93,.92,.98,.9),
               wanted = c("Yes","Skip","No","No","Yes","Yes","Skip","No","Yes"))

data
#> # A tibble: 9 x 4
#>    week name  value wanted
#>   <int> <chr> <dbl> <chr> 
#> 1     1 Joe    0.9  Yes   
#> 2     2 Joe    0.89 Skip  
#> 3     3 Joe    0.99 No    
#> 4     4 Joe    0.98 No    
#> 5     5 Joe    0.87 Yes   
#> 6     7 Joe    0.93 Yes   
#> 7     8 Joe    0.92 Skip  
#> 8     9 Joe    0.98 No    
#> 9    10 Joe    0.9  Yes
library(slider)

data %>% group_by(name) %>%
  mutate(wanted2 = case_when(value < 0.95 & slide_index_lgl(.x = value, 
                                                            .i = week, 
                                                            .f = ~  any(.x < 0.95), 
                                                            .before = 1, 
                                                            .after = -1) ~ 'skip',
                             value < 0.95 ~ 'yes',
                             TRUE ~ 'no'))
#> # A tibble: 9 x 5
#> # Groups:   name [1]
#>    week name  value wanted wanted2
#>   <int> <chr> <dbl> <chr>  <chr>  
#> 1     1 Joe    0.9  Yes    yes    
#> 2     2 Joe    0.89 Skip   skip   
#> 3     3 Joe    0.99 No     no     
#> 4     4 Joe    0.98 No     no     
#> 5     5 Joe    0.87 Yes    yes    
#> 6     7 Joe    0.93 Yes    yes    
#> 7     8 Joe    0.92 Skip   skip   
#> 8     9 Joe    0.98 No     no     
#> 9    10 Joe    0.9  Yes    yes

Even it can be done without using slider ie in dplyr only即使不使用slider也可以完成,即仅在dplyr

library(dplyr)
data %>% group_by(name) %>%
  mutate(wanted2 = case_when(value < 0.95 & lag(value, default = 1) < 0.95 & week - 1 == lag(week, default = 0) ~ 'Skip',
                             value < 0.95 ~ 'Yes',
                             TRUE ~ 'No'))

#> # A tibble: 9 x 5
#> # Groups:   name [1]
#>    week name  value wanted wanted2
#>   <int> <chr> <dbl> <chr>  <chr>  
#> 1     1 Joe    0.9  Yes    Yes    
#> 2     2 Joe    0.89 Skip   Skip   
#> 3     3 Joe    0.99 No     No     
#> 4     4 Joe    0.98 No     No     
#> 5     5 Joe    0.87 Yes    Yes    
#> 6     7 Joe    0.93 Yes    Yes    
#> 7     8 Joe    0.92 Skip   Skip   
#> 8     9 Joe    0.98 No     No     
#> 9    10 Joe    0.9  Yes    Yes

Created on 2021-07-25 by the reprex package (v2.0.0)reprex 包( v2.0.0 ) 于 2021 年 7 月 25 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM