为什么 dplyr 的过滤器会从因子变量中删除 NA 值？

Question

When I use filter from the dplyr package to drop a level of a factor variable, filter also drops the NA values.当我使用filter从所述dplyr包下降的一个因素可变的电平， filter也下降的NA值。 Here's an example:下面是一个例子：

library(dplyr)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#    var1
# 1  <NA>
# 2     3
# 3     3
# 4     1
# 5     1
# 6  <NA>
# 7     2
# 8     2
# 9  <NA>
# 10    1

filter(dat, var1 != 1)
#   var1
# 1    3
# 2    3
# 3    2
# 4    2

This does not seem ideal -- I only wanted to drop rows where var1 == 1 .这似乎并不理想——我只想删除var1 == 1行。

It looks like this is occurring because any comparison with NA returns NA , which filter then drops.看起来这是因为与NA任何比较都会返回NA ，然后filter会下降。 So, for example, filter(dat, !(var1 %in% 1)) produces the correct results.因此，例如， filter(dat, !(var1 %in% 1))产生正确的结果。 But is there a way to tell filter not to drop the NA values?但是有没有办法告诉filter不要删除NA值？

Answer 1

You could use this:你可以用这个：

 filter(dat, var1 != 1 | is.na(var1))
  var1
1 <NA>
2    3
3    3
4 <NA>
5    2
6    2
7 <NA>

And it won't.它不会。

Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:同样只是为了完成，删除 NA 是filter的预期行为，如下所示：

test_that("filter discards NA", {
  temp <- data.frame(
    i = 1:5,
    x = c(NA, 1L, 1L, 0L, 0L)
  )
  res <- filter(temp, x == 1)
  expect_equal(nrow(res), 2L)
})

This test above was taken from the tests for filter from github .上面的这个测试取自github 的filter测试。

Answer 2

The answers previously given are good, but when your filter statement involves a function of many fields, the work around might not be so great.之前给出的答案很好，但是当您的过滤器语句涉及多个字段的函数时，解决方法可能不会那么好。 Also, who wants to use mapply the non-vectorized identical .另外，谁想使用mapply非矢量化的identical . Here is another somewhat simpler solution using coalesce这是另一个使用coalesce更简单的解决方案

filter(dat, coalesce( var1 != 1, TRUE))

Answer 3

I often map identical with mapply ...我经常映射与mapply identical ...

(note: I believe because of changes in R 3.6.0, set.seed and sample end up with different test data) （注意：我相信因为 R 3.6.0 的变化， set.seed和sample最终得到不同的测试数据）

library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#>    var1
#> 1     3
#> 2     1
#> 3  <NA>
#> 4     3
#> 5     1
#> 6     3
#> 7     2
#> 8     3
#> 9     2
#> 10    1

filter(dat, var1 != 1)
#>   var1
#> 1    3
#> 2    3
#> 3    3
#> 4    2
#> 5    3
#> 6    2

filter(dat, !mapply(identical, as.numeric(var1), 1))
#>   var1
#> 1    3
#> 2 <NA>
#> 3    3
#> 4    3
#> 5    2
#> 6    3
#> 7    2

it works for numerics and strings as well (probably more common use case)...它也适用于数字和字符串（可能更常见的用例）...

library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = sample(c(1:3, NA), size = 10, replace = T),
                   var2 = letters[sample(c(1:3, NA), size = 10, replace = T)],
                   stringsAsFactors = FALSE))
#>    var1 var2
#> 1     3 <NA>
#> 2     1    a
#> 3    NA    a
#> 4     3    b
#> 5     1    b
#> 6     3 <NA>
#> 7     2    a
#> 8     3    c
#> 9     2 <NA>
#> 10    1    b

filter(dat, !mapply(identical, var1, 1L))
#>   var1 var2
#> 1    3 <NA>
#> 2   NA    a
#> 3    3    b
#> 4    3 <NA>
#> 5    2    a
#> 6    3    c
#> 7    2 <NA>

filter(dat, !mapply(identical, var2, 'a'))
#>   var1 var2
#> 1    3 <NA>
#> 2    3    b
#> 3    1    b
#> 4    3 <NA>
#> 5    3    c
#> 6    2 <NA>
#> 7    1    b

为什么 dplyr 的过滤器会从因子变量中删除 NA 值？

问题描述

3 个解决方案

解决方案1
24 已采纳 2015-10-02 13:58:50

解决方案2
1 2021-04-23 13:37:35

解决方案3
0 2019-05-23 09:18:18

为什么 dplyr 的过滤器会从因子变量中删除 NA 值？

问题描述

3 个解决方案

解决方案1 24 已采纳 2015-10-02 13:58:50

解决方案2 1 2021-04-23 13:37:35

解决方案3 0 2019-05-23 09:18:18

解决方案1
24 已采纳 2015-10-02 13:58:50

解决方案2
1 2021-04-23 13:37:35

解决方案3
0 2019-05-23 09:18:18