Filter_all() with a negate str_detect approach

Question

Hello dear colleagues,

First thing first, this is my first time I'm asking a question here so I hope I'll be clear. I'm currently facing some challenges when dealing with a lot of dataframes with variable lengths and non-regular colnames. The challenge is to remove unwanted rows (here rows for samples sequenced as Whole genome shotgun sequencing) matching multiple keywords, indeed it would we to easy to have a single keyword... For that purpose I'm unsing filter_all(any_vars(str_detect(., "WGS")) . However, trying to negate the code with negate=T or !str_detect() return the whole dataframe and nothing seems to work. Using all_vars() remove every rows in the df. I came around a solution but I find it quite heavy and I'm pretty sure there is a better way to perform this:

> tmp <- metadata[["PRJNA237362"]]  
> no <- tmp %>% filter_all(any_vars(str_detect(., "WGS")))  
> final <- tmp[tmp$Run %notin% no$Run,]

I'm not very familiar with the tidyverse, still a lot to learn, so I might have missed something here. I don't understand why filter returns the whole df when negating the expression

Thanks to whoever answers. Have a great day. Rémy

A reproducible example of what I'm dealing with

> data(msleep)  
> msleep%>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()  
> msleep%>% filter_all(any_vars(str_detect(., "omni", negate=T))) %>% glimpse()   
> no <- msleep %>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()  
> yes <- msleep[msleep$vore %notin% no$vore,] %>% glimpse()

Here a part of df I'm actually working on:

> df = structure(list(Run = c("ERR2804817", "ERR2804818", "ERR2804819", 
"ERR2804820", "ERR2804821", "ERR2834367", "ERR2834371", "ERR2834373", 
"ERR2834374", "ERR2834375", "ERR2834376", "ERR2834377", "ERR2834379", 
"ERR2828323", "ERR2828326", "ERR2828327", "ERR2828328", "ERR2828330"
), LibraryLayout = c("PAIRED", "PAIRED", "PAIRED", "PAIRED", 
"PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", 
"PAIRED", "PAIRED", "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SINGLE"
), Library.Name = c("Bangladeshi_2yr", "Bangladeshi_2yr", "Bangladeshi_2yr", 
"Bangladeshi_2yr", "Bangladeshi_2yr", "table S7A,B; WGS", "table S7A,B; WGS", 
"table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS", 
"table S7A,B; WGS", "table S7A,B; WGS", "table S12", "table S12", 
"table S12", "table S12", "table S12"), LibrarySource = c("METAGENOMIC", 
"METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "GENOMIC", 
"GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", 
"GENOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC", 
"METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC"), Instrument = c("Illumina MiSeq", 
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", 
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", 
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", 
"NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 73L, 74L, 75L, 76L, 77L, 
78L, 79L, 80L, 806L, 807L, 808L, 809L, 810L), class = "data.frame")

> #Here is what I have for now
> `%notin%` = Negate(`%in%`)
> tmp = metadata %>% filter_all(any_vars(everything(), str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> meta= meta[meta$Run%notin%tmp$Run,]

Ultimately, I would like to make something like that:

> tmp = meta %>% filter_all(any_vars(!str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> #OR this version
> tmp = meta %>% filter_all(any_vars(str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS", negate=T)))

The trick is that I can't predict the colnames of my df nor the dimension of my df so I wrote for() loops with a conditions to detect pattern, remove them and write a new file with the cleaned df.

For now my code is working but I'm sure there is a better way to do it.

Thanks a lot.

> packageVersion("tidyverse")
[1] ‘1.3.0’
> packageVersion("dplyr")
[1] ‘1.0.5’
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rmdformats_1.0.1 ggpubr_0.4.0     forcats_0.5.1    stringr_1.4.0   
 [5] dplyr_1.0.5      purrr_0.3.4      readr_1.4.0      tidyr_1.1.3     
 [9] tibble_3.1.0     tidyverse_1.3.0  ade4_1.7-16      factoextra_1.0.7
[13] ggplot2_3.3.3    FactoMineR_2.4  

loaded via a namespace (and not attached):
 [1] httr_1.4.2           jsonlite_1.7.2       prettydoc_0.4.1     
 [4] carData_3.0-4        modelr_0.1.8         assertthat_0.2.1    
 [7] cellranger_1.1.0     yaml_2.2.1           progress_1.2.2      
[10] ggrepel_0.9.1        pillar_1.5.1         backports_1.2.1     
[13] lattice_0.20-41      glue_1.4.2           digest_0.6.27       
[16] ggsignif_0.6.1       rvest_0.3.6          colorspace_2.0-0    
[19] cowplot_1.1.1        htmltools_0.5.1.1    pkgconfig_2.0.3     
[22] broom_0.7.5          haven_2.3.1          bookdown_0.21       
[25] scales_1.1.1         openxlsx_4.2.3       rio_0.5.26          
[28] farver_2.1.0         generics_0.1.0       car_3.0-10          
[31] ellipsis_0.3.1       DT_0.17              withr_2.4.1         
[34] cli_2.3.1            magrittr_2.0.1       crayon_1.4.1        
[37] readxl_1.3.1         evaluate_0.14        fs_1.5.0            
[40] fansi_0.4.2          MASS_7.3-53.1        rstatix_0.7.0       
[43] xml2_1.3.2           foreign_0.8-81       tools_4.0.4         
[46] data.table_1.14.0    prettyunits_1.1.1    hms_1.0.0           
[49] lifecycle_1.0.0      munsell_0.5.0        reprex_1.0.0        
[52] zip_2.1.1            cluster_2.1.1        flashClust_1.01-2   
[55] compiler_4.0.4       rlang_0.4.10         grid_4.0.4          
[58] rstudioapi_0.13      htmlwidgets_1.5.3    leaps_3.1           
[61] labeling_0.4.2       rmarkdown_2.7        gtable_0.3.0        
[64] abind_1.4-5          DBI_1.1.1            curl_4.3            
[67] R6_2.5.0             lubridate_1.7.10     knitr_1.31          
[70] utf8_1.2.1           stringi_1.5.3        Rcpp_1.0.6          
[73] vctrs_0.3.6          scatterplot3d_0.3-41 dbplyr_2.1.0        
[76] tidyselect_1.1.0     xfun_0.22

Answer 1

Since you mentioned multiple keywords, you can pass multiple keywords to str_detect() with the regex | (or) operator.

The following lines will filter out (via negate = TRUE all rows where at least one variable has at least one of the given patterns ui|Br|Ch|lis .

library(tidyverse)

keywords_to_remove <- c("ui", "Br", "lis", "Ch", "omni")
keywords_regex <- paste0(keywords_to_remove, collapse = "|")
  
msleep %>% 
  filter(if_all(
    .cols = everything(),
    .fns = ~ stringr::str_detect(.x, keywords_regex, negate = TRUE))
  )
#> # A tibble: 9 x 11
#>   name   genus  vore  order conservation sleep_total sleep_rem sleep_cycle awake
#>   <chr>  <chr>  <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
#> 1 Cow    Bos    herbi Arti… domesticated         4         0.7       0.667  20  
#> 2 Dog    Canis  carni Carn… domesticated        10.1       2.9       0.333  13.9
#> 3 Long-… Dasyp… carni Cing… lc                  17.4       3.1       0.383   6.6
#> 4 Horse  Equus  herbi Peri… domesticated         2.9       0.6       1      21.1
#> 5 Golde… Mesoc… herbi Rode… en                  14.3       3.1       0.2     9.7
#> 6 House… Mus    herbi Rode… nt                  12.5       1.4       0.183  11.5
#> 7 Rabbit Oryct… herbi Lago… domesticated         8.4       0.9       0.417  15.6
#> 8 Labor… Rattus herbi Rode… lc                  13         2.4       0.183  11  
#> 9 Easte… Scalo… inse… Sori… lc                   8.4       2.1       0.167  15.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
packageVersion("dplyr")
#> [1] '1.0.5'

^{Created on 2021-03-23 by the reprex package (v1.0.0)}

Answer 2

Second edit based on updated information:

Another way to approach this is to do a rowwise operation and add a matching column based on your chosen regex matches:

If you want to keep the NA values in your final filter then this should work:

regex_match <- "omni"
msleep %>% 
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), 
regex(regex_match)), na.rm = FALSE)) %>% 
  filter(!regex_match)

If you want to exclude the NAs, then add a replace_na() step:

msleep %>% 
  rowwise() %>%
  mutate(regex_match = any(str_detect(c_across(is.character), regex("omni")), na.rm = FALSE),
         regex_match = replace_na(regex_match, TRUE)) %>% 
  filter(!regex_match)

So the first version with your metadata:

regex_match <-  "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS"
metadata %>% 
  rowwise() %>%
  mutate(regex_match = any(str_detect(c_across(is.character), regex(regex_match)), na.rm = FALSE)) %>%
  filter(!regex_match)

Edit 1

I think the problem lies in the combination of a negation with the syntax any_vars , which means you are returning the whole dataframe because every column has a row with values not containing "omni" or "WGS" from your data.

With the latest version of dplyr syntax, you could try the following:

msleep %>% filter(if_all(starts_with("vore"), ~!str_detect(.x, "omni")))

This focuses on just the one column, or

msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))

for the entire dataframe.

Does that get what you need?

Answer 3

@Marcelo Avila and @awaji98 propositions works on my problem. However, I would like to show that this code as a subtility in the fact that it seems NA are removed with the propositions above:

msleep%>% filter_all(all_vars(str_detect(., "omni", negate=T)))```

msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))

msleep %>% 
  filter(if_all(
    .cols = everything(),
    .fns = ~ stringr::str_detect(.x, "omni", negate = TRUE))
  )

no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))

no
# A tibble: 20 x 11
   name     genus  vore  order  conservation sleep_total sleep_rem sleep_cycle
   <chr>    <chr>  <chr> <chr>  <chr>              <dbl>     <dbl>       <dbl>
 1 Owl mon… Aotus  omni  Prima… NA                  17         1.8      NA    
 2 Greater… Blari… omni  Soric… lc                  14.9       2.3       0.133
 3 Grivet   Cerco… omni  Prima… lc                  10         0.7      NA    
 4 Star-no… Condy… omni  Soric… lc                  10.3       2.2      NA    
 5 African… Crice… omni  Roden… NA                   8.3       2        NA    
 6 Lesser … Crypt… omni  Soric… lc                   9.1       1.4       0.15 
 7 North A… Didel… omni  Didel… lc                  18         4.9       0.333
 8 Europea… Erina… omni  Erina… lc                  10.1       3.5       0.283
 9 Patas m… Eryth… omni  Prima… lc                  10.9       1.1      NA    
10 Galago   Galago omni  Prima… NA                   9.8       1.1       0.55 
11 Human    Homo   omni  Prima… NA                   8         1.9       1.5  
12 Macaque  Macaca omni  Prima… NA                  10.1       1.2       0.75 
13 Chimpan… Pan    omni  Prima… NA                   9.7       1.4       1.42 
14 Baboon   Papio  omni  Prima… NA                   9.4       1         0.667
15 Potto    Perod… omni  Prima… lc                  11        NA        NA    
16 African… Rhabd… omni  Roden… NA                   8.7      NA        NA    
17 Squirre… Saimi… omni  Prima… NA                   9.6       1.4      NA    
18 Pig      Sus    omni  Artio… domesticated         9.1       2.4       0.5  
19 Tenrec   Tenrec omni  Afros… NA                  15.6       2.3      NA    
20 Tree sh… Tupaia omni  Scand… NA                   8.9       2.6       0.233
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>

We find 20 rows that contains the pattern "omni"

no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
msleep[msleep$vore %notin% no$vore,]

# A tibble: 63 x 11
   name     genus  vore  order  conservation sleep_total sleep_rem sleep_cycle
   <chr>    <chr>  <chr> <chr>  <chr>              <dbl>     <dbl>       <dbl>
 1 Cheetah  Acino… carni Carni… lc                  12.1      NA        NA    
 2 Mountai… Aplod… herbi Roden… nt                  14.4       2.4      NA    
 3 Cow      Bos    herbi Artio… domesticated         4         0.7       0.667
 4 Three-t… Brady… herbi Pilosa NA                  14.4       2.2       0.767
 5 Norther… Callo… carni Carni… vu                   8.7       1.4       0.383
 6 Vesper … Calom… NA    Roden… NA                   7        NA        NA    
 7 Dog      Canis  carni Carni… domesticated        10.1       2.9       0.333
 8 Roe deer Capre… herbi Artio… lc                   3        NA        NA    
 9 Goat     Capri  herbi Artio… lc                   5.3       0.6      NA    
10 Guinea … Cavis  herbi Roden… domesticated         9.4       0.8       0.217
# … with 53 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

This remove efficiently the 20 rows and return 63 rows df. However, because of the NA it seems that the following code (and the others above) return a wrong df.

library(tidyverse)
msleep %>% 
  filter(
    if_all(
      everything(), 
      ~stringr::str_detect(., "omni", negate = T)
    )
  )
 A tibble: 15 x 11
   name     genus  vore  order  conservation sleep_total sleep_rem sleep_cycle
   <chr>    <chr>  <chr> <chr>  <chr>              <dbl>     <dbl>       <dbl>
 1 Cow      Bos    herbi Artio… domesticated         4         0.7       0.667
 2 Dog      Canis  carni Carni… domesticated        10.1       2.9       0.333
 3 Guinea … Cavis  herbi Roden… domesticated         9.4       0.8       0.217
 4 Chinchi… Chinc… herbi Roden… domesticated        12.5       1.5       0.117
 5 Long-no… Dasyp… carni Cingu… lc                  17.4       3.1       0.383
 6 Big bro… Eptes… inse… Chiro… lc                  19.7       3.9       0.117
 7 Horse    Equus  herbi Peris… domesticated         2.9       0.6       1    
 8 Domesti… Felis  carni Carni… domesticated        12.5       3.2       0.417
 9 Golden … Mesoc… herbi Roden… en                  14.3       3.1       0.2  
10 House m… Mus    herbi Roden… nt                  12.5       1.4       0.183
11 Rabbit   Oryct… herbi Lagom… domesticated         8.4       0.9       0.417
12 Laborat… Rattus herbi Roden… lc                  13         2.4       0.183
13 Eastern… Scalo… inse… Soric… lc                   8.4       2.1       0.167
14 Thirtee… Sperm… herbi Roden… lc                  13.8       3.4       0.217
15 Brazili… Tapir… herbi Peris… vu                   4.4       1         0.9  
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>

There is something weird when negating the str_detect() .
If anyone has an insight on this it would be tremendous as I fear I will have trouble sleeping tonight.

Thanks a lot, Cheers from Paris.

Filter_all() with a negate str_detect approach

Question

3 answers

solution1
0 2021-03-22 16:38:12

solution2
0 ACCPTED 2021-03-22 16:46:20

solution3
0 2021-03-23 20:13:28

Filter_all() with a negate str_detect approach

Question

3 answers

solution1 0 2021-03-22 16:38:12

solution2 0 ACCPTED 2021-03-22 16:46:20

solution3 0 2021-03-23 20:13:28

solution1
0 2021-03-22 16:38:12

solution2
0 ACCPTED 2021-03-22 16:46:20

solution3
0 2021-03-23 20:13:28