简体   繁体   中英

How to filter rows for every column independently using dplyr

I have the following tibble:


library(tidyverse)
df <- tibble::tribble(
  ~gene, ~colB, ~colC,
  "a",   1,  2,
  "b",   2,  3,
  "c",   3,  4,
  "d",   1,  1
)

df
#> # A tibble: 4 x 3
#>    gene  colB  colC
#>   <chr> <dbl> <dbl>
#> 1     a     1     2
#> 2     b     2     3
#> 3     c     3     4
#> 4     d     1     1

What I want to do is to filter every columns after gene column for values greater or equal 2 (>=2). Resulting in this:

gene, colB, colC
a   NA   2
b   2    3
c   3    4

How can I achieve that?

The number of columns after genes actually is more than just 2.

The forthcoming dplyr 0.6 (install from GitHub now, if you like) has filter_at , which can be used to filter to any rows that have a value greater than or equal to 2, and then na_if can be applied similarly through mutate_at , so

df %>% 
    filter_at(vars(-gene), any_vars(. >= 2)) %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2)))
#> # A tibble: 3 x 3
#>    gene  colB  colC
#>   <chr> <dbl> <dbl>
#> 1     a    NA     2
#> 2     b     2     3
#> 3     c     3     4

or similarly,

df %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2))) %>% 
    filter_at(vars(-gene), any_vars(!is.na(.)))

which can be translated for use with dplyr 0.5:

df %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2))) %>% 
    filter(rowSums(is.na(.)) < (ncol(.) - 1))

All return the same thing.

One solution: convert from wide to long format, so you can filter on just one column, then convert back to wide at the end if required. Note that this will drop genes where no values meet the condition.

library(tidyverse)
df %>% 
gather(name, value, -gene) %>% 
  filter(value >= 2) %>% 
  spread(name, value)

# A tibble: 3 x 3
   gene  colB  colC
* <chr> <dbl> <dbl>
1     a    NA     2
2     b     2     3
3     c     3     4

We can use data.table

library(data.table)
setDT(df)[df[, Reduce(`|`, lapply(.SD, `>=`, 2)), .SDcols = colB:colC]
   ][, (2:3) := lapply(.SD, function(x) replace(x, x < 2, NA)), .SDcols = colB:colC][]
#   gene colB colC
#1:    a   NA    2
#2:    b    2    3
#3:    c    3    4

Or with melt/dcast

dcast(melt(setDT(df), id.var = 'gene')[value>=2], gene ~variable)
#   gene colB colC
#1:    a   NA    2
#2:    b    2    3
#3:    c    3    4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM