[英]Removing groups with all NA in Data.Table or DPLYR in R
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))
I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'.我有一个高数据框,在该数据框中我想删除包含所有“分数”或所有“时间”的 NA 的学生 IDS。 This is just if it is all NA, if there are some NA then I want to keep all their records...这只是如果全部是 NA,如果有一些 NA 那么我想保留他们所有的记录......
Is this what you want?这是你想要的吗?
library(dplyr)
dataHAVE %>%
group_by(student) %>%
filter(!all(is.na(score)))
student time score
<dbl> <dbl> <dbl>
1 1 1 7
2 1 2 9
3 1 3 5
4 3 1 NA
5 3 2 3
6 3 3 9
7 5 NA 7
8 5 2 NA
9 5 3 5
Each student
is only kept if not ( !
) all
score
values are NA
每个student
只保留如果不是( !
) all
score
值都是NA
Since nobody suggested one, here is a solution using data.table
:由于没有人建议,这里是一个使用data.table
的解决方案:
library(data.table)
dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
Previous but wrong code:以前但错误的代码:
dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]
New and correct code:新的和正确的代码:
dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]
Returns:返回:
student time score
1: 1 1 7
2: 1 2 9
3: 1 3 5
4: 3 1 NA
5: 3 2 3
6: 3 3 9
7: 5 NA 7
8: 5 2 NA
9: 5 3 5
Updatet data.table
solution with @Cole s suggestion...使用@Cole 的建议更新data.table
解决方案...
Here is a base R solution using subset
+ ave
这是使用subset
+ ave
的基本 R 解决方案
dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))
or或者
dataWANT <- subset(dataHAVE,
!Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))
Another option:另外一个选项:
library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]
Create a dummy variable, and filter based on that创建一个虚拟变量,并根据它进行过滤
library("dplyr")
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataHAVE %>%
mutate(check=is.na(time)&is.na(score)) %>%
filter(check == FALSE) %>%
select(-check)
#> student time score
#> 1 1 1 7
#> 2 1 2 9
#> 3 1 3 5
#> 4 2 1 NA
#> 5 2 2 NA
#> 6 2 3 NA
#> 7 3 1 NA
#> 8 3 2 3
#> 9 3 3 9
#> 10 5 NA 7
#> 11 5 2 NA
#> 12 5 3 5
Created on 2020-02-21 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2020 年 2 月 21 日创建
data.table
solution generalising to any number of columns: data.table
解决方案推广到任意数量的列:
dataHAVE[,
.SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)],
by = student]
# student time score
# 1: 1 1 7
# 2: 1 2 9
# 3: 1 3 5
# 4: 3 1 NA
# 5: 3 2 3
# 6: 3 3 9
# 7: 5 NA 7
# 8: 5 2 NA
# 9: 5 3 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.