简体   繁体   中英

Translating filter_all(any vars()) to base R

I have a dataframe with various numbers. What I want, is to subset rows using all column values.

One could use dplyr to write the following code:

library(dplyr)

set.seed(1)

df <- data.frame (matrix (round (runif(500, 0, 1), digits = 1), 10, 5))

dfn <- df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.))) 

Does anyone know what the base R version of this code would be? Any help is very much appreciated.

1) sapply grepl over columns and then take those rows whose sum is positive:

df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ]

2) A variation is to use lapply instead of sapply and do.call/pmax instead of rowSums:

df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ]

3) A third way can be fashioned out of max.col

s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]

4) Reduce with | can be used

df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ]

Benchmark

Below we compare the speeds of the various solutions. p0 is the solution in the question and is the slowest. The rest are not different according to the significance although (2) or (4) above gave the lowest runtimes depending on which metric is used.

library(microbenchmark)

microbenchmark(
P0 = df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.))),
p1 = df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ],
p2 = df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ],
p3 = { s <- sapply(df, grepl, pattern = 0.5)
       df[s[cbind(1:nrow(s), max.col(s))], ]},
p4 = df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ],
p5 = { has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
        df[has_0.5, ]}
)

giving

Unit: microseconds
 expr      min       lq       mean   median        uq      max neval cld
   P0 140597.8 142671.0 173710.712 151614.6 173295.00 487564.7   100   b
   p1    544.4    572.3   1838.821    593.8    623.15 117795.9   100  a 
   p2    485.3    502.2    946.143    514.8    567.15  34891.1   100  a 
   p3    607.9    631.6    766.101    655.6    719.10   3177.0   100  a 
   p4    454.6    473.8    592.819    486.0    538.30   1518.8   100  a 
   p5    945.9    980.4   1344.161   1013.2   1107.80  23137.1   100  a 

One possibility:

has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM