[英]R data.table remove rows where one column is duplicated if another column is NA
Here is an example data.table 这是一个示例数据表
dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))
col1 col2
1: A NA
2: A dog
3: B cat
4: C jeep
5: C porsch
6: D NA
I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. 如果col2为NA且另一行具有非NA值,我想删除col1重复的行。 AKA group by col1, then if group has more than one row and one of them is NA, remove it.
按col1对AKA进行分组,如果组中有多行并且其中一个是NA,则将其删除。 This would be the result for
dt
: 这将是
dt
的结果:
col1 col2
2: A dog
3: B cat
4: C jeep
5: C porsch
6: D NA
I tried this: 我尝试了这个:
dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]
col1 col2
1: A dog
2: B cat
3: C jeep
4: D NA
What am I missing? 我想念什么? Thank you
谢谢
group by col1, then if group has more than one row and one of them is NA, remove it.
按col1进行分组,然后如果分组有多于一行并且其中之一为NA,则将其删除。
Use an anti-join: 使用反联接:
dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]
col1 col2
1: A dog
2: B cat
3: C jeep
4: C porsch
5: D NA
Benchmark from @thela, but assuming there are no (full) dupes in the original data: 来自@thela的基准,但假设原始数据中没有(完整)重复项:
set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)
system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
# user system elapsed
# 0.73 0.06 0.81
system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
# user system elapsed
# 2.86 0.03 2.89
system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
# user system elapsed
# 0.39 0.01 0.41
fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE
I changed a little for speed. 我为速度做了些改变。 With a
having=
argument , this might become faster and more legible. 使用
having=
参数 ,这可能会变得更快更清晰。
You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1
; 您错过了括号(可能是拼写错误),我想应该是
length(col1) > 1
; And also used ifelse
on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); 并在标量条件下使用
ifelse
,该条件将无法按您期望的那样工作(仅拾取向量中的第一个元素); If you want to remove NA values from a group when there are non NAs, you can use if/else
: 如果要在没有NA的情况下从组中删除NA值,则可以使用
if/else
:
dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]
# col1 col2
#1: A dog
#2: B cat
#3: C jeep
#4: C porsch
#5: D NA
An attempt to find all the NA
cases in groups where there is also a non- NA
value, and then remove those rows: 尝试在还存在非
NA
值的组中查找所有NA
个案,然后删除这些行:
dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
# col1 col2
#1: A dog
#2: B cat
#3: C jeep
#4: C porsch
#5: D NA
Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly: 似乎更快,但是我敢肯定有人很快就会提出更快的版本:
set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
# user system elapsed
# 1.49 0.02 1.51
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
# user system elapsed
# 4.49 0.04 4.54
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.