[英]How to remove duplicates if specific column has value in r
I need to delete some rows in my dataset based on the given condition.我需要根据给定的条件删除数据集中的一些行。 Kindly gothrough the sample data for reference.
请浏览样本数据以供参考。
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column.我主要关心的是 Dur 列。 I have to delete the rows which have Dur != NA for group ID's ie ID's(123,789,852) have more than one record/row with Dur value.
我必须删除组 ID 为 Dur != NA 的行,即 ID(123,789,852) 具有多个具有 Dur 值的记录/行。 so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852. I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
所以我需要删除带有 Dur 值的 ID,这意味着 123 的整个 ID 和 789 和 852 的第一条记录。我不想删除任何 ID(564,741,852)具有单条记录的 Dur 或任何其他 ID 在 Dur 中为 null .
Expected Output:预期输出:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.请建议一个代码来解决这个问题。 Thanks in Advance!
提前致谢!
One way would be to select rows where number of rows in the group is 1 or there are NA
's rows in the data.一种方法是选择组中行数为 1 或数据中有
NA
行的行。
This can be written in dplyr
as :这可以用
dplyr
写成:
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table
:使用
data.table
:
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :和基础 R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data数据
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
We can use .I
in data.table
我们可以在
data.table
使用.I
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.