简体   繁体   English

R dplyr过滤值大于+ N且小于-N的数据:abs()函数?

[英]R dplyr filtering data with values greater than +N and lesser than -N : abs() function?

I am using the dplyr package in R for filtering my data of gene expressions. 我正在R中使用dplyr软件包来过滤我的基因表达数据。 I have calculated fold changes and would like to filter the genes (rows) in which at least one sample (columns) has a value greater than +0.584963 OR less than -0.584963.An example data: 我已经计算出倍数变化,并希望过滤至少一个样本(列)的值大于+0.584963或小于-0.584963的基因(行)。示例数据:

       X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_3      0.0170      0.1560      0.1120      0.0080     -0.1230
GENE_4     -0.0944     -0.1372     -0.1800     -0.2228     -0.2656
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_6      0.2091      0.1106      0.0121     -0.0864     -0.1849
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

I am currently using the following script 我目前正在使用以下脚本

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)

but this returns only the genes greater than +0.584963 and not the negative ones. 但这只会返回大于+0.584963的基因,而不会返回阴性基因。 In the case of the example, what I want is a subsetted list with Genes 1, 2, 5 and 7. But instead it gives me only Genes 1 and 7. How can I change this? 在该示例的情况下,我想要的是具有基因1、2、5和7的子集列表。但是相反,它仅给我基因1和7。我该如何更改?

I am expecting the answer to be in this format: 我希望答案采用以下格式:

 X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

Thanks. 谢谢。

Using filter_at from dplyr might be an even more flexible approach... 使用filter_atdplyr可能是一种更加灵活的方法...

# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

mydata2 %>%
  filter_at(vars(contains("fc")), .vars_predicate =  any_vars(abs(.) > 0.584963))

In the vars() you can define your list of variables to which you want to apply the filtering. vars()您可以定义要对其应用过滤的变量列表。 Following .vars_predicate you can define the filter criterion ( any_vars is equal to | , all_vars is equal to & ). .vars_predicate之后,您可以定义过滤条件( any_vars等于|all_vars等于& )。

Long story short, you had the abs() in the wrong place in your code. 长话短说,您将abs()放置在代码中的错误位置。

I fixed it here: 我在这里固定:

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)

For further flexibility, @ha_pu provided a great filter_at solution building off my previous solution (before I identified the error in your code). 为了获得更大的灵活性,@ ha_pu提供了一个很棒的filter_at解决方案,以我以前的解决方案为基础(在我确定您的代码中的错误之前)。

Here's a solution that is flexible to the number of samples and data rows. 这是一个可以灵活处理样本和数据行数的解决方案。 It involves transforming the data into long format and then filters for the gene and specific sample. 它涉及将数据转换为长格式,然后过滤基因和特定样本。 I tested it on 50k genes and 35 samples, and it ran in < 1 second. 我在50k个基因和35个样本上对其进行了测试,并且运行时间不到1秒。

library(tidyverse)

# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
                     X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
  filter(fc > 0.584963 | fc < -0.584963) %>%
  select(X) %>%
  arrange(desc(X)) %>%
  unique() %>%
  head())
#>         X
#> 1  GENE_7
#> 5  GENE_5
#> 7  GENE_2
#> 8  GENE_1
#> 10   9999
#> 14   9998

Created on 2019-03-01 by the reprex package (v0.2.1) reprex软件包 (v0.2.1)创建于2019-03-01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R 为什么 dplyr 按组计算唯一值 (n_distinct) 的速度比 data.table (uniqueN) 快? - R Why dplyr counts unique values (n_distinct) by groups faster than data.table (uniqueN)? 在r中生成序列大于1但小于n的向量 - Generate a vector of sequence greater than 1 but less than n in r 返回一个列表,其中包含 R 中每个矩阵行的大于 N 的所有值 - Return a list containing all values greater than N of each matrix row in R 如何查找哪个列表的R中有n个大于x的n个值? - How do I find which list has n numbers of values greater than x in R? 在 r 中检查逻辑变量之和是否大于 n,使用 na - checking if sum of logical variables is greater than n, with na, in r 过滤大于Dplyr R中数字的比例 - Filter the proportions greater than a number in Dplyr R 通过取这些重复的平均值来过滤超过 n 次重复的数据 - Filtering data with more than n repetitions by taking a mean of those repetitions 对于分组数据帧(dplyr)R中的每个元素,值的总和大于或等于 - Sum of values greater than or equal too for each element in grouped dataframe (dplyr) R R中的Arg函数能否给出大于pi的值? - Can Arg function in R give values greater than pi? 在列名中加入大于和小于号的 data.tables 时出现问题 - Issue when joining data.tables with greater than and lesser than signs in column names
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM