On updating my own answer to another thread , I wasn't able to come up with a good solution to replace the last example (see below). The idea is to get all rows where any column contains a certain string, in my example "V".
library(tidyverse)
#get all rows where any column contains 'V'
diamonds %>%
filter_all(any_vars(grepl('V',.))) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# this does naturally not give the desired output!
diamonds %>%
filter(across(everything(), ~ grepl('V', .))) %>%
head
#> # A tibble: 0 x 10
I found a thread where the poster ponders over similar stuff , but applying a similar logic on grepl does not work.
### don't run, this is ugly and does not work
diamonds %>%
rowwise %>%
filter(any(grepl("V", across(everything())))) %>%
head
This is very difficult, because the example shows that you want to filter data from all columns when any of them meets the condition (ie you want a union ). That's done with filter_all()
and any_vars()
.
While filter(across(everything(), ...))
filters out from all columns when all of them meet the condition (ie this is a intersection , quite opposite of the previous).
To convert it from intersection to the union (ie to get again rows where any of the columns meet the condition), you probably need to check the row sum for that:
diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
It will sum all the TRUE
s that appear in the row, ie if there is at least one value meeting the condition, that row sum will be > 0
and will be shown.
I'm sorry for across()
is not the very first child of filter()
, but it's at least some idea how to do that. :-)
Evaluation:
Using @TimTeaFan's method to check that:
identical(
{diamonds %>%
filter_all(any_vars(grepl('V',.)))
},
{diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
}
)
#> [1] TRUE
Benchmark:
As per our discussion under TimTeaFan's answer, here is a comparison, surprisingly, all solutions have a similar time:
library(tidyverse)
microbenchmark::microbenchmark(
filter_all = {diamonds %>%
filter_all(any_vars(grepl('V',.)))},
purrr_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))},
base_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% Reduce(`|`, .))},
rowsums = {diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)},
times = 100L,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> filter_all 295.7235 302.1311 309.6455 305.0491 310.0335 449.3619 100
#> purrr_reduce 297.8220 302.4411 310.2829 306.2929 312.2278 461.0194 100
#> base_reduce 298.5033 303.6170 309.4147 306.1839 312.3518 409.5273 100
#> rowsums 295.3863 301.0281 307.8517 305.3142 309.4793 372.8867 100
Created on 2020-07-14 by the reprex package (v0.3.0)
This is the equivalent to the filter_all
call you posted. However, @akrun is totally correct to point out, that it should be converted to character first. Nevertheless, this also holds true for your filter_all
statement.
The idea is to use across(everything(), ~ grepl('V', .))
to get the whole data.frame transformed into columns of TRUE
and FALSE
regarding grepl('V', .)
. However, filter
needs a vector, or a data.frame with one column so we transform it by using reduce( |
). It combines the first two columns with |
then the result of this call with the third column and so on, until the original data.frame has one column with TRUE
and FALSE
which can then be used to filter the rows.
library(ggplot2)
library(dplyr)
diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`)) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
identical({diamonds %>%
filter_all(any_vars(grepl('V',.)))},
{diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))
})
#> [1] TRUE
Created on 2020-07-14 by the reprex package (v0.3.0)
Some of the columns were ordered
and it will affect with c_across
. Instead, if we convert to character
class and then do the grepl
it should work
library(dplyr)
library(ggplot2)
diamonds %>%
head %>%
mutate(across(where(is.factor), as.character)) %>%
rowwise %>%
filter(any(grepl("V", c_across(where(is.character)))))
# A tibble: 3 x 10
# Rowwise:
# carat cut color clarity depth table price x y z
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.