简体   繁体   中英

How to omit rows with NA in only two columns in R?

I want to omit rows where NA appears in both of two columns.

I'm familiar with na.omit , is.na , and complete.cases , but can't figure out how to use these to get what I want. For example, I have the following dataframe:

(df <- structure(list(x = c(1L, 2L, NA, 3L, NA),
                     y = c(4L, 5L, NA, 6L, 7L),
                     z = c(8L, 9L, 10L, 11L, NA)),
                .Names = c("x", "y", "z"),
                class = "data.frame",
                row.names = c(NA, -5L)))
x   y   z
1   4   8
2   5   9
NA  NA  10
3   6   11
NA  7   NA

and I want to remove only those rows where NA appears in both the x and y columns (excluding anything in z), to give

x   y   z
1   4   8
2   5   9
3   6   11
NA  7   NA

Does anyone know an easy way to do this? Using na.omit , is.na , or complete.cases is not working.

df[!with(df,is.na(x)& is.na(y)),]
#      x y  z
#1  1 4  8
#2  2 5  9
#4  3 6 11
#5 NA 7 NA

I did benchmarked on a slightly bigger dataset. Here are the results:

set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T)) 

f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function()  df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] 

library(microbenchmark)

microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr       min        lq    median        uq       max neval
# f1()  1.000000  1.000000  1.000000  1.000000  1.000000   100
# f2()  1.044812  1.068189  1.138323  1.129611  0.856396   100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378   100

Use rowSums with is.na , like this:

> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
   x y  z
1  1 4  8
2  2 5  9
4  3 6 11
5 NA 7 NA

Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:

## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20), 
                                      1e6, replace = TRUE)))

First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.

f1 <- function() {
  df[!with(df, is.na(X1) & is.na(X2)), ]
} 
f2 <- function() {
  df[rowSums(is.na(df[1:2])) != 2, ]
} 

library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
#  expr      min       lq   median       uq      max neval
#  f1() 745.8378 1100.764 1128.047 1199.607 1310.236    20
#  f2() 784.2132 1101.695 1125.380 1163.675 1303.161    20

Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.

f1_5 <- function() {
  df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
             is.na(X4) & is.na(X5)), ]
} 
f2_5 <- function() {
  df[rowSums(is.na(df[1:5])) != 5, ]
} 

microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
#    expr      min       lq   median       uq      max neval
#  f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772    20
#  f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915    20

You can apply to slice up the rows:

sel <- apply( df, 1, function(x) sum(is.na(x))>1 )

Then you can select with that:

df[ sel, ]

To ignore the z column, just omit it from the apply:

sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )

If they all have to be TRUE , just change the function up a little:

sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )

The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems. The cost is speed (usual caveats about small datasets and speed testing apply):

> microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
Unit: microseconds
                                              expr     min       lq   median       uq      max neval
              df[!with(df, is.na(x) & is.na(y)), ]  67.148  71.5150  76.0340  86.0155 1049.576   100
        df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945  498.934   100
 df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150  321.583   100

dplyr solution

require("dplyr")
df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))

can be modified to take any number columns using the .vars argument


Update: dplyr 1.0.4

df %>%
  filter(!if_all(c(x, y), is.na))

See similar answer: https://stackoverflow.com/a/66136167/6105259

This is also the very basic dplyr solution:

library(dplyr)

df %>%
  filter(!(is.na(x) & is.na(y)))

   x y  z
1  1 4  8
2  2 5  9
3  3 6 11
4 NA 7 NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM