简体   繁体   English

根据另一列的特定类别删除重复项

[英]Removing duplicates based on a specific category of another column

I would like to remove duplicate IDs in my data using the Category columns.我想使用Category列删除数据中的重复IDs A subset of my data is as follows:我的数据的一个子集如下:

df <- data.frame(ID=c(1,2,3,4,1,4,2),
                 category=c("a","b","c","d","b","a","a"))
df

  ID category
1  1        a
2  2        b
3  3        c
4  4        d
5  1        b
6  4        a
7  2        a

If there is a duplicated ID from Category b , I need to keep it and remove the corresponding ID from other categories.如果Category b有重复的ID ,我需要保留它并从其他类别中删除相应的 ID。 And, I have no priority if the duplicated IDs are form other categories excluding Category b .而且,如果重复的IDs来自除Category b之外的其他类别,我没有优先权。 So, my favorite outcome is:所以,我最喜欢的结果是:

  ID category
1  2        b
2  3        c
3  4        d
4  1        b

I have already read this post : R: Remove duplicates from a dataframe based on categories in a column but can't find my answer我已经阅读了这篇文章: R:根据列中的类别从数据框中删除重复项但找不到我的答案

We could do an arrange to that 'b' category rows are arranged at the top and then get the distinct rows by 'ID'我们可以arrange将 'b' 类别行排列在顶部,然后通过 'ID' 获取distinct

library(dplyr)
df %>%
     arrange(category != 'b') %>% 
     distinct(ID, .keep_all = TRUE)

-output -输出

  ID category
1  2        b
2  1        b
3  3        c
4  4        d

Or using base R或使用base R

df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]

In base R you could do:在基础 R 中,您可以执行以下操作:

 subset(df, !category %in% category[ID %in% ID[category == 'b'] & category !='b'])
  ID category
1  2        b
2  3        c
3  4        d
4  1        b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM