[英]R dplyr filtering data with values greater than +N and lesser than -N : abs() function?
I am using the dplyr
package in R for filtering my data of gene expressions. 我正在R中使用
dplyr
软件包来过滤我的基因表达数据。 I have calculated fold changes and would like to filter the genes (rows) in which at least one sample (columns) has a value greater than +0.584963 OR less than -0.584963.An example data: 我已经计算出倍数变化,并希望过滤至少一个样本(列)的值大于+0.584963或小于-0.584963的基因(行)。示例数据:
X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1 0.6780 0.4050 0.8870 0.3300 0.2230
GENE_2 0.2340 -0.6670 0.0020 0.1240 0.3560
GENE_3 0.0170 0.1560 0.1120 0.0080 -0.1230
GENE_4 -0.0944 -0.1372 -0.1800 -0.2228 -0.2656
GENE_5 -0.8080 -0.7800 -0.5560 0.0340 0.4450
GENE_6 0.2091 0.1106 0.0121 -0.0864 -0.1849
GENE_7 0.5980 0.7680 0.9970 0.4670 -0.7760
I am currently using the following script 我目前正在使用以下脚本
det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)
but this returns only the genes greater than +0.584963 and not the negative ones. 但这只会返回大于+0.584963的基因,而不会返回阴性基因。 In the case of the example, what I want is a subsetted list with Genes 1, 2, 5 and 7. But instead it gives me only Genes 1 and 7. How can I change this?
在该示例的情况下,我想要的是具有基因1、2、5和7的子集列表。但是相反,它仅给我基因1和7。我该如何更改?
I am expecting the answer to be in this format: 我希望答案采用以下格式:
X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1 0.6780 0.4050 0.8870 0.3300 0.2230
GENE_2 0.2340 -0.6670 0.0020 0.1240 0.3560
GENE_5 -0.8080 -0.7800 -0.5560 0.0340 0.4450
GENE_7 0.5980 0.7680 0.9970 0.4670 -0.7760
Thanks. 谢谢。
Using filter_at
from dplyr
might be an even more flexible approach... 使用
filter_at
的dplyr
可能是一种更加灵活的方法...
# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))
# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])
mydata2 %>%
filter_at(vars(contains("fc")), .vars_predicate = any_vars(abs(.) > 0.584963))
In the vars()
you can define your list of variables to which you want to apply the filtering. 在
vars()
您可以定义要对其应用过滤的变量列表。 Following .vars_predicate
you can define the filter criterion ( any_vars
is equal to |
, all_vars
is equal to &
). 在
.vars_predicate
之后,您可以定义过滤条件( any_vars
等于|
, all_vars
等于&
)。
Long story short, you had the abs()
in the wrong place in your code. 长话短说,您将
abs()
放置在代码中的错误位置。
I fixed it here: 我在这里固定:
det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)
For further flexibility, @ha_pu provided a great filter_at
solution building off my previous solution (before I identified the error in your code). 为了获得更大的灵活性,@ ha_pu提供了一个很棒的
filter_at
解决方案,以我以前的解决方案为基础(在我确定您的代码中的错误之前)。
Here's a solution that is flexible to the number of samples and data rows. 这是一个可以灵活处理样本和数据行数的解决方案。 It involves transforming the data into long format and then filters for the gene and specific sample.
它涉及将数据转换为长格式,然后过滤基因和特定样本。 I tested it on 50k genes and 35 samples, and it ran in < 1 second.
我在50k个基因和35个样本上对其进行了测试,并且运行时间不到1秒。
library(tidyverse)
# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))
# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])
(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
filter(fc > 0.584963 | fc < -0.584963) %>%
select(X) %>%
arrange(desc(X)) %>%
unique() %>%
head())
#> X
#> 1 GENE_7
#> 5 GENE_5
#> 7 GENE_2
#> 8 GENE_1
#> 10 9999
#> 14 9998
Created on 2019-03-01 by the reprex package (v0.2.1) 由reprex软件包 (v0.2.1)创建于2019-03-01
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.