I have a dataframe with various numbers. What I want, is to subset rows using all column values.
One could use dplyr to write the following code:
library(dplyr)
set.seed(1)
df <- data.frame (matrix (round (runif(500, 0, 1), digits = 1), 10, 5))
dfn <- df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.)))
Does anyone know what the base R version of this code would be? Any help is very much appreciated.
1) sapply grepl over columns and then take those rows whose sum is positive:
df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ]
2) A variation is to use lapply instead of sapply and do.call/pmax instead of rowSums:
df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ]
3) A third way can be fashioned out of max.col
s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]
4) Reduce with | can be used
df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ]
Below we compare the speeds of the various solutions. p0 is the solution in the question and is the slowest. The rest are not different according to the significance although (2) or (4) above gave the lowest runtimes depending on which metric is used.
library(microbenchmark)
microbenchmark(
P0 = df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.))),
p1 = df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ],
p2 = df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ],
p3 = { s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]},
p4 = df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ],
p5 = { has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]}
)
giving
Unit: microseconds
expr min lq mean median uq max neval cld
P0 140597.8 142671.0 173710.712 151614.6 173295.00 487564.7 100 b
p1 544.4 572.3 1838.821 593.8 623.15 117795.9 100 a
p2 485.3 502.2 946.143 514.8 567.15 34891.1 100 a
p3 607.9 631.6 766.101 655.6 719.10 3177.0 100 a
p4 454.6 473.8 592.819 486.0 538.30 1518.8 100 a
p5 945.9 980.4 1344.161 1013.2 1107.80 23137.1 100 a
One possibility:
has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.