简体   繁体   English

根据目录中所有文件中的数据计算成对spearman的等级相关性

[英]Calculate pairwise spearman's rank correlation from data present in all files in a directory

I'm trying to calculate Spearman's rank correlation, where the data (tsv with name and rank) for each experiment is stored in separate files in a directory. 我正在尝试计算Spearman的等级相关性,其中每个实验的数据(名称和等级的tsv)存储在目录中的单独文件中。

Following is the format of input files: 以下是输入文件的格式:

#header not present
#geneName   value
ENSMUSG00000026179.14   14.5648627685587
ENSMUSG00000026179.14   0.652158034413075
ENSMUSG00000026179.14   0.652158034413075
ENSMUSG00000026179.14   1.852158034413075
ENSMUSG00000026176.13   4.13033421794948
ENSMUSG00000026176.13   4.13033421794948
ENSMUSG00000026176.13   15.4344068144428
ENSMUSG00000026176.13   15.4344068144428
ENSMUSG00000026176.13   6.9563523670728
...

My problem is that the keys(gene names) are repetitive, and each experiment file contains different but overlapping set of gene names. 我的问题是密钥(基因名称)是重复的,每个实验文件包含不同但重叠的基因名称集。 What I need is an intersection of gene names for each pair while performing the correlation and removing duplicates, probably something like this pseudo code: 我需要的是每个对的基因名称的交集,同时执行相关并删除重复,可能类似于这个伪代码:

# Find correlation for all possible pairs of input(i.e. files in directory)
files = list_Of_files("directory")
for(i in files) {
    for(k in files) {
    CommonGenes <- intersect (i,k)
    tempi <- removeRepetitive(i, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes.
    tempk <- removeRepetitive(k, CommonGenes) #Keep the gene with highest value and remove all other repeating genes. Also, keep only common genes. 
    correlationArray[] <- spearman(tempi, tempk) #Perform correlation for only the common genes
}
}

Ultimately, I want to plot the correlation matrix using corrplot or qtlcharts . 最后,我想使用corrplotqtlcharts绘制相关矩阵。

First, read all the data into a list of dataframes, see this post for more info, here we are just creating a dummy data. 首先,将所有数据读入数据帧列表,有关详细信息,请参阅此帖子 ,这里我们只是创建一个虚拟数据。

library(dplyr)

# dummy data
set.seed(1)
myDfs <- list(
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15)),
  data.frame(geneName = sample(LETTERS[1:4], 15, replace = TRUE), value = runif(15))
)

Then, just like your two nested for loops, what we have here is two nested apply functions. 然后,就像你的两个嵌套for循环一样,我们这里有两个嵌套的apply函数。 Within loops we are aggregating and getting correlation on matched merged genes names. 在循环内,我们聚合并获得匹配的合并基因名称的相关性。

res <- sapply(myDfs, function(i){
  # group by gene, get max value
  imax <- i %>% group_by(geneName) %>% summarise(i_Max = max(value))
  sapply(myDfs, function(j){
    # group by gene, get max value
    jmax <- j %>% group_by(geneName) %>% summarise(j_Max = max(value))
    # get overlapping genes
    ij <- merge(imax, jmax, by = "geneName")
    # return correlation
    cor(ij$i_Max, ij$j_Max, method = "spearman")
  })
})

res will have the correlation matrix. res将具有相关矩阵。

res

#      [,1] [,2] [,3] [,4] [,5]
# [1,]  1.0 -0.2  1.0  0.4 -0.4
# [2,] -0.2  1.0 -0.2  0.8  0.0
# [3,]  1.0 -0.2  1.0  0.4 -0.4
# [4,]  0.4  0.8  0.4  1.0 -0.4
# [5,] -0.4  0.0 -0.4 -0.4  1.0

For correlation plot there are many alternatives to choose from . 对于相关图,有许多选择可供选择 Here as an example we are using corrplot : 这里我们使用corrplot作为例子:

corrplot::corrplot(res)

在此输入图像描述

Here's an alternative solution. 这是另一种解决方案。 Rather than having a nested loop, it uses expand.grid to create the combinations, and then uses a pipeline of ‹dplyr› verbs to calculate correlations on a subset of the master table. 它使用expand.grid创建组合,然后使用<dplyr>动词的管道来计算主表子集的相关性,而不是嵌套循环。

This approach has both advantages and disadvantages. 这种方法既有优点也有缺点。 Foremost it fits nicely into the “tidy data” approach, and there are some who advocate to work in tidy data as much as possible . 最重要的是,它非常适合“整洁的数据”方法,并且有些人主张尽可能多地处理整洁的数据 The actual code is about as long as zx8754's. 实际代码大约与zx8754一样长。

library(dplyr)

genes = sprintf('ENSMUSG%011d', 1 : 50)
my_dfs = replicate(4, tibble(Gene = sample(genes, 20, replace = TRUE), Value = runif(20)),
                   simplify = FALSE)

First off we want to make the gene names unique because everything subsequently requires unique genes per table: 首先,我们希望使基因名称独特,因为每个表随后都需要每个表独特的基因:

my_dfs = lapply(my_dfs, function (x) summarize(group_by(x, Gene), Value = max(Value)))

Now we can create all permutations of this list: 现在我们可以创建此列表的所有排列:

combinations = bind_cols(expand.grid(i = seq_along(my_dfs), j = seq_along(my_dfs)),
                         expand.grid(x = my_dfs, y = my_dfs))

At this point, we have a table with the indices of all pairwise combinations i , j , as well as the combinations themselves as list columns: 此时,我们有一个表,其中包含所有成对组合ij的索引,以及组合本身作为列表列:

# A tibble: 16 x 4
       i     j                 x                 y
   <int> <int>            <list>            <list>
 1     1     1 <tibble [17 x 2]> <tibble [17 x 2]>
 2     2     1 <tibble [18 x 2]> <tibble [17 x 2]>
 3     3     1 <tibble [19 x 2]> <tibble [17 x 2]>
…

We now group by the indices and join the single list columns in each group by gene names: 我们现在按索引进行分组,并按基因名称连接每个组中的单个列表列:

correlations = combinations %>%
    group_by(i, j) %>%
    do(inner_join(.$x[[1]], .$y[[1]], by = 'Gene')) %>%
    print() %>%
    summarize(Cor = cor(Value.x, Value.y, method = 'spearman'))

Intermission: at the print() line we are left with a fully-expanded table of all pairwise combinations of all gene tables (the Value columns of the two original tables have been renamed into Value.x and Value.y , respectively): 间歇:在print()行,我们留下了一个完全展开的表,列出了所有基因表的所有成对组合(两个原始表的Value列分别重命名为Value.xValue.y ):

# A tibble: 182 x 5
# Groups:   i, j [16]
       i     j               Gene    Value.x    Value.y
   <int> <int>              <chr>      <dbl>      <dbl>
 1     1     1 ENSMUSG00000000014 0.93470523 0.93470523
 2     1     1 ENSMUSG00000000019 0.21214252 0.21214252
 3     1     1 ENSMUSG00000000028 0.65167377 0.65167377
 4     1     1 ENSMUSG00000000043 0.12555510 0.12555510
 5     1     1 ENSMUSG00000000010 0.26722067 0.26722067
 6     1     1 ENSMUSG00000000041 0.38611409 0.38611409
 7     1     1 ENSMUSG00000000042 0.01339033 0.01339033
…

The next line trivially calculates pairwise correlations from these tables, using the same groups. 下一行使用相同的组轻松计算这些表的成对相关性。 Since the whole table is in long format, it can be conveniently plotted with ‹ggplot2›: 由于整个表格为长格式,因此可以使用<ggplot2>方便地绘制:

library(ggplot2)

ggplot(correlations) +
    aes(i, j, color = Cor) +
    geom_tile() +
    scale_color_gradient2()

在此输入图像描述

… but if you need this as a square correlation matrix instead, nothing is easier: ...但是如果你需要这个作为方形相关矩阵,那么没有什么比这更容易了:

corr_mat = with(correlations, matrix(Cor, nrow = max(i)))
      [,1]  [,2]  [,3]  [,4]
[1,]  1.00  1.00 -0.20 -0.26
[2,]  1.00  1.00 -0.43 -0.50
[3,] -0.20 -0.43  1.00 -0.90
[4,] -0.26 -0.50 -0.90  1.00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM