[英]Select groups with only consecutive runs of a certain value
I have data grouped by 'id', and a column 'x' that can be "yes", "no" or NA
.我有按“id”分组的数据,以及可以是“是”、“否”或NA
的列“x”。
I want to keep only those 'id' where 'x' (1) contains two "yes", and (2) there are no "no" values between the "yes".我只想保留那些“id”,其中“x”(1)包含两个“是”,(2)“是”之间没有“否”值。 NA
between the two "yes" is fine.两个“是”之间的NA
很好。
Some toy data:一些玩具数据:
data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
x = c(NA,'yes',NA,'yes',NA,NA,NA,NA,'yes','yes',NA,'no', 'no',NA,NA,'yes',
'no','yes','no','yes','no', 'yes',NA, 'no','yes', 'no'))
id x
1 1 <NA>
2 1 yes # 1st yes
3 1 <NA>
4 1 yes # 2nd yes, only NA between, yes is considered as consecutive -> keep group 1
5 1 <NA>
6 1 <NA>
7 2 <NA>
8 2 <NA>
9 2 yes # 1st yes
10 2 yes # 2nd yes, yes is consecutive -> keep group 2
11 2 <NA>
12 3 no
13 3 yes # 1st yes
14 3 <NA>
15 3 <NA>
16 3 yes # 2nd yes -> keep group 3
17 4 no
18 4 yes # 1st yes
19 4 no # "no"
20 4 yes # 2nd yes. a "no" between the two 'yes' -> remove group
21 4 no
22 5 yes # 1st yes
23 5 <NA>
24 5 no # "no"
25 5 yes # 2nd yes. a "no" between the two 'yes' -> remove group
26 5 no
Desired Output所需 Output
1 1 <NA>
2 1 yes
3 1 <NA>
4 1 yes
5 1 <NA>
6 1 <NA>
7 2 <NA>
8 2 <NA>
9 2 yes
10 2 yes
11 2 <NA>
12 3 no
13 3 yes
14 3 <NA>
15 3 <NA>
16 3 yes
id 4 and id 5 should be removed as they do not meet the criteria of two consecutive "yes" values for column 'x' per group 'id', irrespective of NA
values between two yes values. id 4 和 id 5 应该被删除,因为它们不符合每组“id”列“x”的两个连续“是”值的标准,无论两个“是”值之间的NA
值如何。
I tried using我尝试使用
data1<-data %>% group_by(id) %>%
mutate(x_lag = lag(x),
is_two_yes = x == 'yes' & x_lag == 'yes') %>%
filter(any(is_two_yes)) %>%
select(-is_two_yes,-x_lag)
This relies only on lag
and lead
.这仅依赖于lag
和lead
。 To me it makes sense, since you're only aiming at filtering out id
's where a no
is lead and followed by two yes
.对我来说这是有道理的,因为您的目标只是过滤掉id
,其中一个no
是前导,然后是两个yes
。
uneligible <- data %>% filter(!is.na(x)) %>% group_by(id) %>%
mutate(prev_x=dplyr::lag(x, default="none"),
next_x=dplyr::lead(x, default="none"),
is_uneligible=any(x=="no"&prev_x=="yes"&next_x=="yes")) %>%
dplyr::filter(is_uneligible) %>%
select(id) %>% unique
# A tibble: 2 x 1
# Groups: id [2]
id
<dbl>
4
5
result <- data %>% filter(!id %in% uneligible$id)
id x
1 1 <NA>
2 1 yes
3 1 <NA>
4 1 yes
5 1 <NA>
6 1 <NA>
7 2 <NA>
8 2 <NA>
9 2 yes
10 2 yes
11 2 <NA>
12 3 no
13 3 no
14 3 <NA>
15 3 <NA>
16 3 yes
EDIT if you want to keep only id
s with at least two yes
, you can use the following.编辑如果您只想保留至少两个yes
的id
,您可以使用以下内容。
uneligible <- data %>% filter(!is.na(x)) %>% group_by(id) %>%
mutate(prev_x=dplyr::lag(x, default="none"),
next_x=dplyr::lead(x, default="none"),
is_uneligible=any(x=="no"&prev_x=="yes"&next_x=="yes")|sum(x %in% "yes")<2) %>%
dplyr::filter(is_uneligible) %>% dplyr::select(id) %>% unique
result <- data %>% filter(!id %in% uneligible$id)
This will however filter out id=3
in your example, as your dput
doesn't match your data.但是,这将在您的示例中过滤掉id=3
,因为您的dput
与您的数据不匹配。
> result
id x
1 1 <NA>
2 1 yes
3 1 <NA>
4 1 yes
5 1 <NA>
6 1 <NA>
7 2 <NA>
8 2 <NA>
9 2 yes
10 2 yes
11 2 <NA>
data <- data.frame(id = rep(1:5, each = 5),
x = c(NA, 'yes', NA, 'yes', NA,
NA, NA, NA, 'yes', 'yes',
NA, 'no', "yes", NA, 'yes',
'no', 'yes', 'no', 'yes', NA,
'yes', NA, 'no','yes', 'no'))
twoYes <- function(x){
v <- c()
cum <- 0
for (i in x){
if (i == "yes" & !is.na(i)){
cum <- cum + 1 # if met "yes", cumulatively + 1
v <- c(v, cum)
}else{
if(i == "no" & !is.na(i)){
cum <- 0 # if met "no", restore to zero
v <- c(v, cum)
}else{
v <- c(v, cum) # if met "NA", retain value
}
}
}
return(v) # therefore, v > 1 means two continuous "yes" met
}
df <- data |>
group_by(id) |>
mutate(v = twoYes(x)) |>
filter(v > 1)
unique(df$id) # id: 1, 2, 3 have two continuous "yes"
[1] 1 2 3 [1] 1 2 3
We can add a variable a
to refer 2 --> yes
, 1 --> no
and 0 --> NA
, then filter to exclude the rows having NA
then using zoo::rollsum
with window of 2's, so if we get the value 4 then this group has two consecutive yes
我们可以添加一个变量a
来引用 2 --> yes
、 1 --> no
和 0 --> NA
,然后过滤以排除具有NA
的行,然后使用zoo::rollsum
和 window 为 2,所以如果我们得到值 4 那么这个组有两个连续的yes
library(tidyverse)
data |>
mutate(a = case_when(x == "yes" ~ 2 , x == "no" ~ 1 , TRUE ~ 0)) |>
group_by(id) |> filter(a != 0) |>
mutate(b = c(first(a) , zoo::rollsum(a , 2))) |>
summarise(groups_to_keep = id[which(b == 4)]) -> gk
data |> filter(id %in% gk$groups_to_keep)
id x
1 1 <NA>
2 1 yes
3 1 <NA>
4 1 yes
5 1 <NA>
6 1 <NA>
7 2 <NA>
8 2 <NA>
9 2 yes
10 2 yes
11 2 <NA>
I want to keep only those 'id' where 'x' (1) contains two "yes", and (2) there are no "no" values between the "yes".我只想保留那些“id”,其中“x”(1)包含两个“是”,(2)“是”之间没有“否”值。 NA between the two "yes" is fine.两个“是”之间的 NA 很好。
This is unclear what to do in the case of more than two "yes" values.这不清楚在超过两个“是”值的情况下该怎么做。 This answer assumes id
s with only 2 "yes" values are allowed.此答案假定id
只允许使用 2 个“是”值。 If this is not the case, simply modify the transition matrix.如果不是这种情况,只需修改转换矩阵。
Note that I modified data
at row 13 ("no" -> "yes") to match the output shown in the question.请注意,我修改了第 13 行的data
(“否”->“是”)以匹配问题中显示的 output。
First define a state transition matrix.首先定义一个 state 转换矩阵。 Here the rows are the states and the columns are the transitions.这里的行是状态,列是转换。 The initial state is row 1 ( initial
).初始 state 是第 1 行( initial
)。 The state of each id
will change depending on what x
value is encountered (eg, encountering a yes
while in the initial state will cause the system to transition to state 2 ( 1yes
), etc.)每个id
的 state 将根据遇到的x
值而改变(例如,在初始 state 中遇到yes
将导致系统过渡到1yes
2)
m <- matrix(as.integer(c(2,4,3,3,1,3,3,4,1,2,3,4)), 4, 3, dimnames = list(c("initial", "1yes", "drop", "keep"), c("yes", "no", "NA")))
m
#> yes no NA
#> initial 2 1 1
#> 1yes 4 3 2
#> drop 3 3 3
#> keep 3 4 4
Use Reduce
to get the final state of each id
in a grouping operation with data.table
, keeping only those id
s that end with 4 ( keep
):在使用 data.table 的分组操作中,使用Reduce
获取每个id
的最终data.table
,只保留那些以 4 ( keep
) 结尾的id
:
library(data.table)
setDT(data)[
, idx := match(x, c("yes", "no"), 3L)
][
, .(x = if (Reduce(function(i, j) m[i, j], idx, 1L) == 4L) x else character(0)), id
]
#> id x
#> 1: 1 <NA>
#> 2: 1 yes
#> 3: 1 <NA>
#> 4: 1 yes
#> 5: 1 <NA>
#> 6: 1 <NA>
#> 7: 2 <NA>
#> 8: 2 <NA>
#> 9: 2 yes
#> 10: 2 yes
#> 11: 2 <NA>
#> 12: 3 no
#> 13: 3 yes
#> 14: 3 <NA>
#> 15: 3 <NA>
#> 16: 3 yes
library(dplyr)
data |>
group_by(id) |>
filter(any(x[!is.na(x)] == 'yes' & lag(x[!is.na(x)]) == 'yes'))
# id x
# <dbl> <chr>
# 1 1 NA
# 2 1 yes
# 3 1 NA
# 4 1 yes
# 5 1 NA
# 6 1 NA
# 7 2 NA
# 8 2 NA
# 9 2 yes
# 10 2 yes
# 11 2 NA
Data:数据:
data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
x = c(NA,'yes',NA,'yes',NA,NA,NA,NA,'yes','yes',NA,'no', 'no',NA,NA,'yes',
'no','yes','no','yes','no', 'yes',NA, 'no','yes', 'no'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.