简体   繁体   English

如果特定列在 r 中有值,如何删除重复项

[英]How to remove duplicates if specific column has value in r

I need to delete some rows in my dataset based on the given condition.我需要根据给定的条件删除数据集中的一些行。 Kindly gothrough the sample data for reference.请浏览样本数据以供参考。

ID  Date       Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA

My main concern is Dur column.我主要关心的是 Dur 列。 I have to delete the rows which have Dur != NA for group ID's ie ID's(123,789,852) have more than one record/row with Dur value.我必须删除组 ID 为 Dur != NA 的行,即 ID(123,789,852) 具有多个具有 Dur 值的记录/行。 so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852. I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.所以我需要删除带有 Dur 值的 ID,这意味着 123 的整个 ID 和 789 和 852 的第一条记录。我不想删除任何 ID(564,741,852)具有单条记录的 Dur 或任何其他 ID 在 Dur 中为 null .

Expected Output:预期输出:

ID  Date       Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA

Kindly suggest a code to solve the issue.请建议一个代码来解决这个问题。 Thanks in Advance!提前致谢!

One way would be to select rows where number of rows in the group is 1 or there are NA 's rows in the data.一种方法是选择组中行数为 1 或数据中有NA行的行。

This can be written in dplyr as :这可以用dplyr写成:

library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))

#    ID Date         Dur
#  <int> <chr>      <int>
#1   564 04/04/2012     2
#2   741 01/08/2011     5
#3   789 08/01/2010    NA
#4   789 05/05/2011    NA
#5   852 03/02/2016    NA
#6   155 03/02/2008    NA
#7   155 01/01/2009    NA
#8   159 07/07/2008    NA

Using data.table :使用data.table

library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]

and base R :和基础 R :

subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))

data数据

df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L, 
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002", 
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011", 
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)), 
class = "data.frame", row.names = c(NA, -12L))

We can use .I in data.table我们可以在data.table使用.I

library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM