简体   繁体   English

Select 组仅具有特定值的连续运行

[英]Select groups with only consecutive runs of a certain value

I have data grouped by 'id', and a column 'x' that can be "yes", "no" or NA .我有按“id”分组的数据,以及可以是“是”、“否”或NA的列“x”。

I want to keep only those 'id' where 'x' (1) contains two "yes", and (2) there are no "no" values between the "yes".我只想保留那些“id”,其中“x”(1)包含两个“是”,(2)“是”之间没有“否”值。 NA between the two "yes" is fine.两个“是”之间的NA很好。

Some toy data:一些玩具数据:

data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
                   x = c(NA,'yes',NA,'yes',NA,NA,NA,NA,'yes','yes',NA,'no', 'no',NA,NA,'yes',
                       'no','yes','no','yes','no', 'yes',NA, 'no','yes', 'no'))
   id    x
1   1 <NA>
2   1  yes # 1st yes
3   1 <NA>
4   1  yes # 2nd yes, only NA between, yes is considered as consecutive -> keep group 1 
5   1 <NA>
6   1 <NA>
7   2 <NA>
8   2 <NA>
9   2  yes  # 1st yes
10  2  yes  # 2nd yes, yes is consecutive -> keep group 2  
11  2 <NA>
12  3   no
13  3  yes  # 1st yes
14  3 <NA>
15  3 <NA>
16  3  yes  # 2nd yes -> keep group 3
17  4   no
18  4  yes # 1st yes
19  4   no # "no"
20  4  yes # 2nd yes. a "no" between the two 'yes' -> remove group
21  4   no
22  5  yes  # 1st yes
23  5 <NA>
24  5   no # "no"
25  5  yes # 2nd yes. a "no" between the two 'yes' -> remove group 
26  5   no

Desired Output所需 Output

1   1 <NA>
2   1  yes
3   1 <NA>
4   1  yes
5   1 <NA>
6   1 <NA>
7   2 <NA>
8   2 <NA>
9   2  yes
10  2  yes
11  2 <NA>
12  3   no
13  3  yes
14  3 <NA>
15  3 <NA>
16  3  yes

id 4 and id 5 should be removed as they do not meet the criteria of two consecutive "yes" values for column 'x' per group 'id', irrespective of NA values between two yes values. id 4 和 id 5 应该被删除,因为它们不符合每组“id”列“x”的两个连续“是”值的标准,无论两个“是”值之间的NA值如何。

I tried using我尝试使用

data1<-data %>% group_by(id) %>% 
  mutate(x_lag = lag(x), 
         is_two_yes = x == 'yes' & x_lag == 'yes') %>% 
  filter(any(is_two_yes)) %>% 
  select(-is_two_yes,-x_lag) 

This relies only on lag and lead .这仅依赖于laglead To me it makes sense, since you're only aiming at filtering out id 's where a no is lead and followed by two yes .对我来说这是有道理的,因为您的目标只是过滤掉id ,其中一个no是前导,然后是两个yes

uneligible <- data %>% filter(!is.na(x)) %>% group_by(id) %>% 
  mutate(prev_x=dplyr::lag(x, default="none"),
         next_x=dplyr::lead(x, default="none"),
         is_uneligible=any(x=="no"&prev_x=="yes"&next_x=="yes")) %>% 
           dplyr::filter(is_uneligible) %>% 
           select(id) %>% unique 

# A tibble: 2 x 1
# Groups:   id [2]
id
<dbl>
  4
  5

result <- data %>% filter(!id %in% uneligible$id)

   id    x
1   1 <NA>
2   1  yes
3   1 <NA>
4   1  yes
5   1 <NA>
6   1 <NA>
7   2 <NA>
8   2 <NA>
9   2  yes
10  2  yes
11  2 <NA>
12  3   no
13  3   no
14  3 <NA>
15  3 <NA>
16  3  yes

EDIT if you want to keep only id s with at least two yes , you can use the following.编辑如果您只想保留至少两个yesid ,您可以使用以下内容。

uneligible <- data %>% filter(!is.na(x)) %>% group_by(id) %>% 
  mutate(prev_x=dplyr::lag(x, default="none"),
         next_x=dplyr::lead(x, default="none"),
         is_uneligible=any(x=="no"&prev_x=="yes"&next_x=="yes")|sum(x %in% "yes")<2) %>% 
        dplyr::filter(is_uneligible) %>% dplyr::select(id) %>% unique 
result <- data %>% filter(!id %in% uneligible$id)

This will however filter out id=3 in your example, as your dput doesn't match your data.但是,这将在您的示例中过滤掉id=3 ,因为您的dput与您的数据不匹配。

> result
   id    x
1   1 <NA>
2   1  yes
3   1 <NA>
4   1  yes
5   1 <NA>
6   1 <NA>
7   2 <NA>
8   2 <NA>
9   2  yes
10  2  yes
11  2 <NA>
data <- data.frame(id = rep(1:5, each = 5),
                   x = c(NA, 'yes', NA, 'yes', NA,
                         NA, NA, NA, 'yes', 'yes',
                         NA, 'no', "yes", NA, 'yes', 
                         'no', 'yes', 'no', 'yes', NA, 
                         'yes', NA, 'no','yes', 'no'))

twoYes <- function(x){
  v <- c()
  cum <- 0
  for (i in x){
    if (i == "yes" & !is.na(i)){
      cum <- cum + 1      # if met "yes",  cumulatively + 1
      v <- c(v, cum)
    }else{
      if(i == "no" & !is.na(i)){
        cum <- 0          # if met "no",  restore to zero
        v <- c(v, cum)
      }else{
        v <- c(v, cum)    # if met "NA", retain value
      }
    }
  }
  return(v)    # therefore, v > 1 means two continuous "yes" met
}

df <- data |> 
  group_by(id) |> 
  mutate(v = twoYes(x)) |> 
  filter(v > 1)

unique(df$id)       # id: 1, 2, 3 have two continuous "yes"

[1] 1 2 3 [1] 1 2 3

We can add a variable a to refer 2 --> yes , 1 --> no and 0 --> NA , then filter to exclude the rows having NA then using zoo::rollsum with window of 2's, so if we get the value 4 then this group has two consecutive yes我们可以添加一个变量a来引用 2 --> yes 、 1 --> no和 0 --> NA ,然后过滤以排除具有NA的行,然后使用zoo::rollsum和 window 为 2,所以如果我们得到值 4 那么这个组有两个连续的yes

library(tidyverse)

data |>
   mutate(a = case_when(x == "yes" ~ 2 , x == "no" ~ 1 , TRUE ~ 0)) |>
   group_by(id) |> filter(a != 0) |>
   mutate(b = c(first(a) , zoo::rollsum(a , 2))) |>
   summarise(groups_to_keep = id[which(b == 4)]) -> gk

data |> filter(id %in% gk$groups_to_keep)

  • output output
   id    x
1   1 <NA>
2   1  yes
3   1 <NA>
4   1  yes
5   1 <NA>
6   1 <NA>
7   2 <NA>
8   2 <NA>
9   2  yes
10  2  yes
11  2 <NA>

I want to keep only those 'id' where 'x' (1) contains two "yes", and (2) there are no "no" values between the "yes".我只想保留那些“id”,其中“x”(1)包含两个“是”,(2)“是”之间没有“否”值。 NA between the two "yes" is fine.两个“是”之间的 NA 很好。

This is unclear what to do in the case of more than two "yes" values.这不清楚在超过两个“是”值的情况下该怎么做。 This answer assumes id s with only 2 "yes" values are allowed.此答案假定id允许使用 2 个“是”值。 If this is not the case, simply modify the transition matrix.如果不是这种情况,只需修改转换矩阵。

Note that I modified data at row 13 ("no" -> "yes") to match the output shown in the question.请注意,我修改了第 13 行的data (“否”->“是”)以匹配问题中显示的 output。

First define a state transition matrix.首先定义一个 state 转换矩阵。 Here the rows are the states and the columns are the transitions.这里的行是状态,列是转换。 The initial state is row 1 ( initial ).初始 state 是第 1 行( initial )。 The state of each id will change depending on what x value is encountered (eg, encountering a yes while in the initial state will cause the system to transition to state 2 ( 1yes ), etc.)每个id的 state 将根据遇到的x值而改变(例如,在初始 state 中遇到yes将导致系统过渡到1yes 2)

m <- matrix(as.integer(c(2,4,3,3,1,3,3,4,1,2,3,4)), 4, 3, dimnames = list(c("initial", "1yes", "drop", "keep"), c("yes", "no", "NA")))
m
#>         yes no NA
#> initial   2  1  1
#> 1yes      4  3  2
#> drop      3  3  3
#> keep      3  4  4

Use Reduce to get the final state of each id in a grouping operation with data.table , keeping only those id s that end with 4 ( keep ):在使用 data.table 的分组操作中,使用Reduce获取每个id的最终data.table ,只保留那些以 4 ( keep ) 结尾的id

library(data.table)

setDT(data)[
  , idx := match(x, c("yes", "no"), 3L)
][
  , .(x = if (Reduce(function(i, j) m[i, j], idx, 1L) == 4L) x else character(0)), id
]
#>     id    x
#>  1:  1 <NA>
#>  2:  1  yes
#>  3:  1 <NA>
#>  4:  1  yes
#>  5:  1 <NA>
#>  6:  1 <NA>
#>  7:  2 <NA>
#>  8:  2 <NA>
#>  9:  2  yes
#> 10:  2  yes
#> 11:  2 <NA>
#> 12:  3   no
#> 13:  3  yes
#> 14:  3 <NA>
#> 15:  3 <NA>
#> 16:  3  yes
library(dplyr)
data |>
  group_by(id) |> 
  filter(any(x[!is.na(x)] == 'yes' & lag(x[!is.na(x)]) == 'yes'))

#       id x    
#    <dbl> <chr>
#  1     1 NA   
#  2     1 yes  
#  3     1 NA   
#  4     1 yes  
#  5     1 NA   
#  6     1 NA   
#  7     2 NA   
#  8     2 NA   
#  9     2 yes  
# 10     2 yes  
# 11     2 NA 

Data:数据:

data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
                   x = c(NA,'yes',NA,'yes',NA,NA,NA,NA,'yes','yes',NA,'no', 'no',NA,NA,'yes',
                       'no','yes','no','yes','no', 'yes',NA, 'no','yes', 'no'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM