[英]Removing duplicates based on a specific category of another column
I would like to remove duplicate IDs
in my data using the Category
columns.我想使用Category
列删除数据中的重复IDs
。 A subset of my data is as follows:我的数据的一个子集如下:
df <- data.frame(ID=c(1,2,3,4,1,4,2),
category=c("a","b","c","d","b","a","a"))
df
ID category
1 1 a
2 2 b
3 3 c
4 4 d
5 1 b
6 4 a
7 2 a
If there is a duplicated ID
from Category b
, I need to keep it and remove the corresponding ID from other categories.如果Category b
有重复的ID
,我需要保留它并从其他类别中删除相应的 ID。 And, I have no priority if the duplicated IDs
are form other categories excluding Category b
.而且,如果重复的IDs
来自除Category b
之外的其他类别,我没有优先权。 So, my favorite outcome is:所以,我最喜欢的结果是:
ID category
1 2 b
2 3 c
3 4 d
4 1 b
I have already read this post : R: Remove duplicates from a dataframe based on categories in a column but can't find my answer我已经阅读了这篇文章: R:根据列中的类别从数据框中删除重复项但找不到我的答案
We could do an arrange
to that 'b' category rows are arranged at the top and then get the distinct
rows by 'ID'我们可以arrange
将 'b' 类别行排列在顶部,然后通过 'ID' 获取distinct
行
library(dplyr)
df %>%
arrange(category != 'b') %>%
distinct(ID, .keep_all = TRUE)
-output -输出
ID category
1 2 b
2 1 b
3 3 c
4 4 d
Or using base R
或使用base R
df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]
In base R you could do:在基础 R 中,您可以执行以下操作:
subset(df, !category %in% category[ID %in% ID[category == 'b'] & category !='b'])
ID category
1 2 b
2 3 c
3 4 d
4 1 b
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.