R dplyr过滤值大于+ N且小于-N的数据：abs（）函数？

Question

I am using the dplyr package in R for filtering my data of gene expressions. 我正在R中使用dplyr软件包来过滤我的基因表达数据。 I have calculated fold changes and would like to filter the genes (rows) in which at least one sample (columns) has a value greater than +0.584963 OR less than -0.584963.An example data: 我已经计算出倍数变化，并希望过滤至少一个样本（列）的值大于+0.584963或小于-0.584963的基因（行）。示例数据：

       X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_3      0.0170      0.1560      0.1120      0.0080     -0.1230
GENE_4     -0.0944     -0.1372     -0.1800     -0.2228     -0.2656
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_6      0.2091      0.1106      0.0121     -0.0864     -0.1849
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

I am currently using the following script 我目前正在使用以下脚本

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)

but this returns only the genes greater than +0.584963 and not the negative ones. 但这只会返回大于+0.584963的基因，而不会返回阴性基因。 In the case of the example, what I want is a subsetted list with Genes 1, 2, 5 and 7. But instead it gives me only Genes 1 and 7. How can I change this? 在该示例的情况下，我想要的是具有基因1、2、5和7的子集列表。但是相反，它仅给我基因1和7。我该如何更改？

I am expecting the answer to be in this format: 我希望答案采用以下格式：

 X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

Thanks. 谢谢。

Answer 1

Using filter_at from dplyr might be an even more flexible approach... 使用filter_at的dplyr可能是一种更加灵活的方法...

# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

mydata2 %>%
  filter_at(vars(contains("fc")), .vars_predicate =  any_vars(abs(.) > 0.584963))

In the vars() you can define your list of variables to which you want to apply the filtering. 在vars()您可以定义要对其应用过滤的变量列表。 Following .vars_predicate you can define the filter criterion ( any_vars is equal to | , all_vars is equal to & ). 在.vars_predicate之后，您可以定义过滤条件（ any_vars等于| ， all_vars等于& ）。

Answer 2

Long story short, you had the abs() in the wrong place in your code. 长话短说，您将abs()放置在代码中的错误位置。

I fixed it here: 我在这里固定：

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)

For further flexibility, @ha_pu provided a great filter_at solution building off my previous solution (before I identified the error in your code). 为了获得更大的灵活性，@ ha_pu提供了一个很棒的filter_at解决方案，以我以前的解决方案为基础（在我确定您的代码中的错误之前）。

Answer 3

Here's a solution that is flexible to the number of samples and data rows. 这是一个可以灵活处理样本和数据行数的解决方案。 It involves transforming the data into long format and then filters for the gene and specific sample. 它涉及将数据转换为长格式，然后过滤基因和特定样本。 I tested it on 50k genes and 35 samples, and it ran in < 1 second. 我在50k个基因和35个样本上对其进行了测试，并且运行时间不到1秒。

library(tidyverse)

# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
                     X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
  filter(fc > 0.584963 | fc < -0.584963) %>%
  select(X) %>%
  arrange(desc(X)) %>%
  unique() %>%
  head())
#>         X
#> 1  GENE_7
#> 5  GENE_5
#> 7  GENE_2
#> 8  GENE_1
#> 10   9999
#> 14   9998

^{Created on 2019-03-01 by the reprex package (v0.2.1)} ^{由reprex软件包（v0.2.1）创建于2019-03-01}

R dplyr过滤值大于+ N且小于-N的数据：abs（）函数？

问题描述

3 个解决方案

解决方案1
1 2019-02-28 09:43:53

解决方案2
1 已采纳 2019-03-01 09:09:01

解决方案3
0 2019-03-01 05:37:44

R dplyr过滤值大于+ N且小于-N的数据：abs（）函数？

问题描述

3 个解决方案

解决方案1 1 2019-02-28 09:43:53

解决方案2 1 已采纳 2019-03-01 09:09:01

解决方案3 0 2019-03-01 05:37:44

解决方案1
1 2019-02-28 09:43:53

解决方案2
1 已采纳 2019-03-01 09:09:01

解决方案3
0 2019-03-01 05:37:44