简体   繁体   English

使用带有apply()函数的pspearman包计算相关矩阵

[英]Calculating a correlation matrix with pspearman package with apply() function

I am trying to calculate the Spearman correlation and p-value for a data frame. 我正在尝试为数据帧计算Spearman相关性和p值。 For better p-value approxiamation, I must stick to the pspearman package. 为了获得更好的p值近似值,我必须坚持使用pspearman软件包。 I am expecting a result similar with the rcorr() function. 我期望得到与rcorr()函数相似的结果。 But I have a problem when performing pspearman:spearman.test() row by row. 但是我在pspearman:spearman.test()执行pspearman:spearman.test()pspearman:spearman.test()问题。

My dataframe contains 5000 rows (genes), and 200 columns(spots). 我的数据框包含5000行(基因)和200列(点)。 And I want to get a correlation matrix and p-value matrix for these 5000*5000 gene-gene pairs. 我想获得这5000 * 5000个基因-基因对的相关矩阵和p值矩阵。 The correlation is only calculated when both two genes are not NAs in more than two spots. 仅当两个基因在两个以上斑点中都不是NA时才计算相关性。

I can achieve this with loops but it is too slow for my big dataset. 我可以使用循环来实现,但是对于我的大型数据集来说太慢了。 I have problems when I try to use apply(),sapply(),mapply() to improve the speed. 我尝试使用apply(),sapply(),mapply()来提高速度时遇到问题。

This is what I've tried: 这是我尝试过的:

data = data.frame(matrix(rbinom(10*100000, 50, .5), ncol=200))
dim(data) #5000, 200
rownames(data) <- paste("gene", 1:5000, sep="") 
colnames(data) <- paste("spot",1:200,sep='')

library(pspearman)
spearFunc = function(x,y=data) {
  df = rbind(x,y)
  # Check the number of complete spots.There are no NAs in this set.
  complete = sum(!(is.na(x)) & !(is.na(y)))
  if (complete >=2 ) {
    pspearman::spearman.test(as.numeric(x),as.numeric(y))
    # This function returns a list containing 8 values, like pvalue,correlation
    }}

pair.all1 = mapply(spearFunc,data,data)
dim(pair.all1)
# 8 200, 200 is the number of columns 
pair.all2 = apply(data,1,spearFunc) 

Which results in error: 导致错误:

Error in pspearman::spearman.test(as.numeric(x), as.numeric(y)) : (list) object cannot be coerced to type 'double' pspearman :: spearman.test(as.numeric(x),as.numeric(y))中的错误:(list)对象不能被强制键入'double'

I hope to use spearman.test for every gene pair with apply() to do 我希望对每个带有apply()的基因对使用spearman.test

spearman.test(data[gene1],data[gene1]) 
spearman.test(data[gene1],data[gene2])
....
spearman.test(data[gene1],data[gene5000])
...
spearman.test(data[gene5000],data[gene5000])

It should return a dataframe of 8 rows and 25,000,000 columns(5000*5000 gene pairs). 它应返回8行和25,000,000列(5000 * 5000个基因对)的数据框。

Is it possible to use apply() inside apply() to achieve my purpose? 是否可以在apply()中使用apply()达到我的目的?

Thx! 谢谢!

Consider creating pair-wise combinations of genes from row.names with combn and then iterating through the list of pairs through a defined function. 考虑使用combnrow.names中用combn创建成对的基因组合,然后通过定义的函数遍历对的列表。 Be sure to return an NA structure from if logic to avoid NULL in matrix output. 确保从if逻辑返回一个NA结构,以避免矩阵输出中为NULL

However, be forewarned that pair-wise permutations of 5,000 genes ( choose(5000, 2) ) results very high at 12,497,500 elements! 但是,请注意,5,000个基因(select(5000,2 choose(5000, 2) )的成对排列结果非常高,达到12,497,500个元素! Hence, sapply (a loop itself) may not be that different in performance than for . 因此, sapply (循环本身)的性能可能不会与for Look into parallelizing the iteration. 研究并行化迭代。

gene_combns <- combn(row.names(data), 2, simplify = FALSE)

spear_func <- function(x) {
  # EXTRACT ROWS BY ROW NAMES  
  row1 <- as.numeric(data[x[1],])
  row2 <- as.numeric(data[x[2],]) 

  # Check the number of complete spots.There are no NAs in this set.
  complete = sum(!(is.na(x)) & !(is.na(y)))

  if (complete >=2 ) {
    pspearman::spearman.test(row1, row2)        
  } else {
    c(statistic=NA, parameter=NA, p.value=NA, estimate=NA, 
      null.value=NA, alternative=NA,   method=NA, data.name=NA)
  }
}

pair.all2 <- sapply(gene_combns, spear_func)

Testing 测试中

Above has been tested with cor.test (exactly same input args and output list as spearman.test but more accurate p-value ) using a small sample of dataset (50 obs, 20 vars): 上面已经过测试与cor.test (完全相同输入指定参数和输出列表作为spearman.test但更精确的p-value使用的数据集的小样品(50个OBS,20个VARS)):

set.seed(82418)
data <- data.frame(matrix(rbinom(10*100000, 50, .5), ncol=200))[1:50, 1:20]
rownames(data) <- paste0("gene", 1:50) 
colnames(data) <- paste0("spot", 1:20)

gene_combns <- combn(row.names(data), 2, simplify = FALSE)
# [[1]]
# [1] "gene1" "gene2"    
# [[2]]
# [1] "gene1" "gene3"    
# [[3]]
# [1] "gene1" "gene4"    
# [[4]]
# [1] "gene1" "gene5"    
# [[5]]
# [1] "gene1" "gene6"    
# [[6]]
# [1] "gene1" "gene7"

test <- sapply(gene_combns, spear_func)  # SAME FUNC BUT WITH cor.test
test[,1:5]

#             [,1]                              [,2]                             
# statistic   885.1386                          1659.598                         
# parameter   NULL                              NULL                             
# p.value     0.1494607                         0.2921304                        
# estimate    0.3344823                         -0.2478179                       
# null.value  0                                 0                                
# alternative "two.sided"                       "two.sided"                      
# method      "Spearman's rank correlation rho" "Spearman's rank correlation rho"
# data.name   "row1 and row2"                   "row1 and row2"                  
#             [,3]                              [,4]                             
# statistic   1554.533                          1212.988                         
# parameter   NULL                              NULL                             
# p.value     0.4767667                         0.7122505                        
# estimate    -0.1688217                        0.08797877                       
# null.value  0                                 0                                
# alternative "two.sided"                       "two.sided"                      
# method      "Spearman's rank correlation rho" "Spearman's rank correlation rho"
# data.name   "row1 and row2"                   "row1 and row2"                  
#             [,5]                             
# statistic   1421.707                         
# parameter   NULL                             
# p.value     0.7726922                        
# estimate    -0.06895299                      
# null.value  0                                
# alternative "two.sided"                      
# method      "Spearman's rank correlation rho"
# data.name   "row1 and row2"    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM