R data.table删除如果另一列不适用的情况下重复一列的行

Question

Here is an example data.table 这是一个示例数据表

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))

   col1   col2
1:    A     NA
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. 如果col2为NA且另一行具有非NA值，我想删除col1重复的行。 AKA group by col1, then if group has more than one row and one of them is NA, remove it. 按col1对AKA进行分组，如果组中有多行并且其中一个是NA，则将其删除。 This would be the result for dt : 这将是dt的结果：

   col1   col2
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I tried this: 我尝试了这个：

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]

   col1 col2
1:    A  dog
2:    B  cat
3:    C jeep
4:    D   NA

What am I missing? 我想念什么？ Thank you 谢谢

Answer 1

group by col1, then if group has more than one row and one of them is NA, remove it. 按col1进行分组，然后如果分组有多于一行并且其中之一为NA，则将其删除。

Use an anti-join: 使用反联接：

dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]

   col1   col2
1:    A    dog
2:    B    cat
3:    C   jeep
4:    C porsch
5:    D     NA

Benchmark from @thela, but assuming there are no (full) dupes in the original data: 来自@thela的基准，但假设原始数据中没有（完整）重复项：

set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)

system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#    user  system elapsed 
#    0.73    0.06    0.81

system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#    user  system elapsed 
#    2.86    0.03    2.89 

system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
#    user  system elapsed 
#    0.39    0.01    0.41 

fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE

I changed a little for speed. 我为速度做了些改变。 With a having= argument , this might become faster and more legible. 使用having=参数，这可能会变得更快更清晰。

Answer 2

You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1 ; 您错过了括号（可能是拼写错误），我想应该是length(col1) > 1 ; And also used ifelse on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); 并在标量条件下使用ifelse ，该条件将无法按您期望的那样工作（仅拾取向量中的第一个元素）； If you want to remove NA values from a group when there are non NAs, you can use if/else : 如果要在没有NA的情况下从组中删除NA值，则可以使用if/else ：

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]

#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

Answer 3

An attempt to find all the NA cases in groups where there is also a non- NA value, and then remove those rows: 尝试在还存在非NA值的组中查找所有NA个案，然后删除这些行：

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly: 似乎更快，但是我敢肯定有人很快就会提出更快的版本：

set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#   user  system elapsed 
#   1.49    0.02    1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#   user  system elapsed 
#   4.49    0.04    4.54

R data.table删除如果另一列不适用的情况下重复一列的行

问题描述

3 个解决方案

解决方案1
3 2017-08-08 00:45:54

解决方案2
2 2017-08-07 23:30:29

解决方案3
2 已采纳 2017-08-07 23:40:41

R data.table删除如果另一列不适用的情况下重复一列的行

问题描述

3 个解决方案

解决方案1 3 2017-08-08 00:45:54

解决方案2 2 2017-08-07 23:30:29

解决方案3 2 已采纳 2017-08-07 23:40:41

解决方案1
3 2017-08-08 00:45:54

解决方案2
2 2017-08-07 23:30:29

解决方案3
2 已采纳 2017-08-07 23:40:41