简体   繁体   English

R data.table删除如果另一列不适用的情况下重复一列的行

[英]R data.table remove rows where one column is duplicated if another column is NA

Here is an example data.table 这是一个示例数据表

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))

   col1   col2
1:    A     NA
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. 如果col2为NA且另一行具有非NA值,我想删除col1重复的行。 AKA group by col1, then if group has more than one row and one of them is NA, remove it. 按col1对AKA进行分组,如果组中有多行并且其中一个是NA,则将其删除。 This would be the result for dt : 这将是dt的结果:

   col1   col2
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I tried this: 我尝试了这个:

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]

   col1 col2
1:    A  dog
2:    B  cat
3:    C jeep
4:    D   NA

What am I missing? 我想念什么? Thank you 谢谢

group by col1, then if group has more than one row and one of them is NA, remove it. 按col1进行分组,然后如果分组有多于一行并且其中之一为NA,则将其删除。

Use an anti-join: 使用反联接:

dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]

   col1   col2
1:    A    dog
2:    B    cat
3:    C   jeep
4:    C porsch
5:    D     NA

Benchmark from @thela, but assuming there are no (full) dupes in the original data: 来自@thela的基准,但假设原始数据中没有(完整)重复项:

set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)

system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#    user  system elapsed 
#    0.73    0.06    0.81

system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#    user  system elapsed 
#    2.86    0.03    2.89 

system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
#    user  system elapsed 
#    0.39    0.01    0.41 

fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE

I changed a little for speed. 我为速度做了些改变。 With a having= argument , this might become faster and more legible. 使用having=参数 ,这可能会变得更快更清晰。

You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1 ; 您错过了括号(可能是拼写错误),我想应该是length(col1) > 1 ; And also used ifelse on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); 并在标量条件下使用ifelse ,该条件将无法按您期望的那样工作(仅拾取向量中的第一个元素); If you want to remove NA values from a group when there are non NAs, you can use if/else : 如果要在没有NA的情况下从组中删除NA值,则可以使用if/else

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]

#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

An attempt to find all the NA cases in groups where there is also a non- NA value, and then remove those rows: 尝试在还存在非NA值的组中查找所有NA个案,然后删除这些行:

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly: 似乎更快,但是我敢肯定有人很快就会提出更快的版本:

set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#   user  system elapsed 
#   1.49    0.02    1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#   user  system elapsed 
#   4.49    0.04    4.54 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 通过R data.table中的ID删除重复的行,但添加一个新列,并将其连接的日期与另一列 - Remove duplicated rows by ID in R data.table, but add a new column with the concatenated dates from another column 用 NA 替换 data.table 列中的重复值 - Replace duplicated values with NA in a data.table column 在使用data.table的R中,如何排除行以及如何在整数列中包含NA值 - In R using data.table, how does one exclude rows and how does one include NA values in an integer column 通过引用data.table r中的列值来删除行 - remove rows by reference to column values in data.table r 从data.table中删除一列上相同但在另一列上不同的行 - remove rows that are same on one column but different on another from a data.table 如何删除r中data.table中的所有重复行 - How to remove all duplicated rows in data.table in r 如何在没有单独指定列的情况下从任何列不存在的data.table中过滤出行 - How to filter rows out of data.table where any column is NA without specifying columns individually 根据 R 中另一列中的重复值删除一列中的行(删除特定原始数据) - remove rows in one column based on duplicated values in another column in R (remove specific raws) 在data.table R中使用lapply填充不适用的NA列 - column full of NA using lapply in a data.table R 用R中的data.table填充前一列的NA和特定条件 - fill NA with previous column and specific condition with data.table in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM