[英]Flagging an id when having similar columns different values in R
当他们在grade
列中具有不同的grade
值时,我需要标记一个id
。 这是我的示例数据集的样子
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df$Grade <- paste0(df$grade.1, df$grade.2, df$grade.3, df$grade.4)
> df
id grade.1 grade.2 grade.3 grade.4 Grade
1 11 3 3 4 NA 334NA
2 22 4 4 4 NA 444NA
3 33 5 5 6 NA 556NA
4 44 6 NA 5 5 6NA55
5 55 7 7 7 7 7777
当一个id
在grade.1
和grade.2
grade.4
具有不同的等级值grade.3
,该行需要被标记。 在该列中包含NA
不会影响标记。
换句话说,如果最后的Grade
列有任何差异数字,则需要标记该id
。
我想要的 output 应该如下所示:
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA Not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 Not_flagged
有任何想法吗? 谢谢!
一个可能的解决方案:
library(tidyverse)
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df %>%
rowwise %>%
mutate(flag = if_else(length(unique(na.omit(c_across(2:5)))) == 1,
"not-flagged", "flagged")) %>% ungroup
#> # A tibble: 5 × 6
#> id grade.1 grade.2 grade.3 grade.4 flag
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11 3 3 4 NA flagged
#> 2 22 4 4 4 NA not-flagged
#> 3 33 5 5 6 NA flagged
#> 4 44 6 NA 5 5 flagged
#> 5 55 7 7 7 7 not-flagged
使用data.table::uniqueN
,计算向量中唯一元素的数量(并允许去除NA
):
library(data.table)
library(dplyr)
df %>%
rowwise %>%
mutate(flag = if_else(uniqueN(c_across(2:5), na.rm = T) == 1,
"not-flagged", "flagged")) %>% ungroup
这是一个基本的 R 方法。
df$flag <- c("not_flagged", "flagged")[
apply(df[-1L], 1L, \(x) length( (ux <- unique(x))[!is.na(ux)] ) > 1L) + 1L
]
Output
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
使用rle
省略 NA 值的基本 R解决方案。
df$flag <- apply(df[,2:5], 1, function(x)
ifelse(length(rle(x[!is.na(x)])$lengths)==1, "not_flagged", "flagged"))
df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
df <- structure(list(id = c(11, 22, 33, 44, 55), grade.1 = c(3, 4,
5, 6, 7), grade.2 = c(3, 4, 5, NA, 7), grade.3 = c(4, 4, 6, 5,
7), grade.4 = c(NA, NA, NA, 5, 7)), class = "data.frame", row.names = c(NA,
-5L))
来自n_distinct
的dyplr
非常有帮助:这里是使用pivot_longer
和pivot_wider
组合的版本:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(id, Grade),
names_to = "name",
values_to = "value"
) %>%
group_by(id) %>%
mutate(flag = ifelse(n_distinct(value, na.rm = TRUE)==1, "Not flagged", "Flagged")) %>%
pivot_wider(
names_from = name,
values_from = value
)
id Grade flag grade.1 grade.2 grade.3 grade.4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 334NA Flagged 3 3 4 NA
2 22 444NA Not flagged 4 4 4 NA
3 33 556NA Flagged 5 5 6 NA
4 44 6NA55 Flagged 6 NA 5 5
5 55 7777 Not flagged 7 7 7 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.