简体   繁体   English

在 Data.Table 或 R 中的 DPLYR 中删除所有 NA 的组

[英]Removing groups with all NA in Data.Table or DPLYR in R

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))



dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))

I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'.我有一个高数据框,在该数据框中我想删除包含所有“分数”或所有“时间”的 NA 的学生 IDS。 This is just if it is all NA, if there are some NA then I want to keep all their records...这只是如果全部是 NA,如果有一些 NA 那么我想保留他们所有的记录......

Is this what you want?这是你想要的吗?

library(dplyr)

dataHAVE %>%
    group_by(student) %>%
    filter(!all(is.na(score)))

  student  time score
    <dbl> <dbl> <dbl>
1       1     1     7
2       1     2     9
3       1     3     5
4       3     1    NA
5       3     2     3
6       3     3     9
7       5    NA     7
8       5     2    NA
9       5     3     5

Each student is only kept if not ( ! ) all score values are NA每个student只保留如果不是( !all score值都是NA

Since nobody suggested one, here is a solution using data.table :由于没有人建议,这里是一个使用data.table的解决方案:

  library(data.table)
  dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                        "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                        "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

Edit:编辑:

Previous but wrong code:以前但错误的代码:

dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]

New and correct code:新的和正确的代码:

dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]

Returns:返回:

   student time score
1:       1    1     7
2:       1    2     9
3:       1    3     5
4:       3    1    NA
5:       3    2     3
6:       3    3     9
7:       5   NA     7
8:       5    2    NA
9:       5    3     5

Edit:编辑:

Updatet data.table solution with @Cole s suggestion...使用@Cole 的建议更新data.table解决方案...

Here is a base R solution using subset + ave这是使用subset + ave的基本 R 解决方案

dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))

or或者

dataWANT <- subset(dataHAVE,
                   !Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))

Another option:另外一个选项:

library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]

Create a dummy variable, and filter based on that创建一个虚拟变量,并根据它进行过滤

library("dplyr")

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                      "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                      "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

dataHAVE %>% 
  mutate(check=is.na(time)&is.na(score)) %>% 
  filter(check == FALSE) %>% 
  select(-check)
#>    student time score
#> 1        1    1     7
#> 2        1    2     9
#> 3        1    3     5
#> 4        2    1    NA
#> 5        2    2    NA
#> 6        2    3    NA
#> 7        3    1    NA
#> 8        3    2     3
#> 9        3    3     9
#> 10       5   NA     7
#> 11       5    2    NA
#> 12       5    3     5

Created on 2020-02-21 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 2 月 21 日创建

data.table solution generalising to any number of columns: data.table解决方案推广到任意数量的列:

dataHAVE[, 
         .SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)], 
         by = student]

#    student time score
# 1:       1    1     7
# 2:       1    2     9
# 3:       1    3     5
# 4:       3    1    NA
# 5:       3    2     3
# 6:       3    3     9
# 7:       5   NA     7
# 8:       5    2    NA
# 9:       5    3     5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM