在 Data.Table 或 R 中的 DPLYR 中删除所有 NA 的组

Question

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))



dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))

我有一个高数据框，在该数据框中我想删除包含所有“分数”或所有“时间”的 NA 的学生 IDS。 这只是如果全部是 NA，如果有一些 NA 那么我想保留他们所有的记录......

Answer 1

这是你想要的吗？

library(dplyr)

dataHAVE %>%
    group_by(student) %>%
    filter(!all(is.na(score)))

  student  time score
    <dbl> <dbl> <dbl>
1       1     1     7
2       1     2     9
3       1     3     5
4       3     1    NA
5       3     2     3
6       3     3     9
7       5    NA     7
8       5     2    NA
9       5     3     5

每个student只保留如果不是（ ! ） all score值都是NA

Answer 2

由于没有人建议，这里是一个使用data.table的解决方案：

  library(data.table)
  dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                        "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                        "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

编辑：

以前但错误的代码：

dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]

新的和正确的代码：

dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]

返回：

   student time score
1:       1    1     7
2:       1    2     9
3:       1    3     5
4:       3    1    NA
5:       3    2     3
6:       3    3     9
7:       5   NA     7
8:       5    2    NA
9:       5    3     5

编辑：

使用@Cole 的建议更新data.table解决方案...

Answer 3

这是使用subset + ave的基本 R 解决方案

dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))

或者

dataWANT <- subset(dataHAVE,
                   !Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))

Answer 4

另外一个选项：

library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]

Answer 5

创建一个虚拟变量，并根据它进行过滤

library("dplyr")

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                      "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                      "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

dataHAVE %>% 
  mutate(check=is.na(time)&is.na(score)) %>% 
  filter(check == FALSE) %>% 
  select(-check)
#>    student time score
#> 1        1    1     7
#> 2        1    2     9
#> 3        1    3     5
#> 4        2    1    NA
#> 5        2    2    NA
#> 6        2    3    NA
#> 7        3    1    NA
#> 8        3    2     3
#> 9        3    3     9
#> 10       5   NA     7
#> 11       5    2    NA
#> 12       5    3     5

^{由reprex 包(v0.3.0) 于 2020 年 2 月 21 日创建}

Answer 6

data.table解决方案推广到任意数量的列：

dataHAVE[, 
         .SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)], 
         by = student]

#    student time score
# 1:       1    1     7
# 2:       1    2     9
# 3:       1    3     5
# 4:       3    1    NA
# 5:       3    2     3
# 6:       3    3     9
# 7:       5   NA     7
# 8:       5    2    NA
# 9:       5    3     5

在 Data.Table 或 R 中的 DPLYR 中删除所有 NA 的组

问题描述

6 个解决方案

解决方案1
2 已采纳 2020-02-21 12:03:16

解决方案2
2 2020-02-21 12:06:26

编辑：

编辑：

解决方案3
1 2020-02-21 12:02:06

解决方案4
1 2020-02-21 22:30:46

解决方案5
0 2020-02-21 12:06:02

解决方案6
0 2020-02-21 13:48:24

在 Data.Table 或 R 中的 DPLYR 中删除所有 NA 的组

问题描述

6 个解决方案

解决方案1 2 已采纳 2020-02-21 12:03:16

解决方案2 2 2020-02-21 12:06:26

编辑：

编辑：

解决方案3 1 2020-02-21 12:02:06

解决方案4 1 2020-02-21 22:30:46

解决方案5 0 2020-02-21 12:06:02

解决方案6 0 2020-02-21 13:48:24

解决方案1
2 已采纳 2020-02-21 12:03:16

解决方案2
2 2020-02-21 12:06:26

解决方案3
1 2020-02-21 12:02:06

解决方案4
1 2020-02-21 22:30:46

解决方案5
0 2020-02-21 12:06:02

解决方案6
0 2020-02-21 13:48:24