如何使我的循环在R中运行更快？

Question

I'm using a function to get p-values from multiple HWE chi square tests. 我正在使用一个函数来从多个HWE卡方检验中获取p值。 I'm looping through a large matrix called geno.data which is (313 rows x 355232 columns) to do this. 我正在遍历一个名为geno.data的大型矩阵，该矩阵是（313行x 355232列）来执行此操作。 I'm essentially looping two columns of the matrix at a time by row. 我本质上是一次一行地循环矩阵的两列。 It runs very slowly. 它运行非常缓慢。 How can I make it faster? 我怎样才能使其更快？ Thanks 谢谢

library(genetics)
geno.data<-matrix(c("a","c"), nrow=313,ncol=355232)
Num_of_SNPs<-ncol(geno.data) /2
alleles<- vector(length = nrow(geno.data))
HWE_pvalues<-vector(length = Num_of_SNPs)
j<- 1

for (count in 1:Num_of_SNPs){
    for (i in 1:nrow(geno.data)){
        alleles[i]<- levels(genotype(paste(geno.data[i,c(2*j -1, 2*j)], collapse = "/")))
    }
    g2 <- genotype(alleles)
    HWE_pvalues[count]<-HWE.chisq(g2)[3]
    j = j + 2
}

Answer 1

First, note that the posted code will result in an index-out-of-bounds error, because after Num_of_SNPs iterations of the main loop your j value will be ncol(geno.data)-1 and you're accessing columns 2*j-1 and 2*j . 首先，请注意，发布的代码将导致索引越界错误，因为在主循环的Num_of_SNPs次迭代之后，您的j值为ncol(geno.data)-1并且您正在访问列2*j-1和2*j 。 I'm assuming you instead want columns 2*count-1 and 2*count and j can be removed. 我假设您改为希望删除列2*count-1和2*count和j 。

Vectorization is extremely important for writing fast R code. 向量化对于编写快速的R代码极为重要。 In your code you're calling the paste function 313 times, each time passing vectors of length 1. It's much faster in R to call paste once passing vectors of length 313. Here are the original and vectorized interiors of the main for loop: 在您的代码中，每次传递paste长度为1的向量时，都会调用paste函数313次。在R中，传递传递长度为313的向量时，调用paste速度要快得多。这是main for循环的原始内部矢量化：

# Original
get.pval1 <- function(count) {
  for (i in 1:nrow(geno.data)){
    alleles[i]<- levels(genotype(paste(geno.data[i,c(2*count -1, 2*count)], collapse = "/")))
  }
  g2 <- genotype(alleles)
  HWE.chisq(g2)[3]
}

# Vectorized
get.pval2 <- function(count) {
  g2 <- genotype(paste0(geno.data[,2*count-1], "/", geno.data[,2*count]))
  HWE.chisq(g2)[3]
}

We get about a 20x speedup from the vectorization: 向量化可以使速度提高20倍：

library(microbenchmark)
all.equal(get.pval1(1), get.pval2(1))
# [1] TRUE
microbenchmark(get.pval1(1), get.pval2(1))
# Unit: milliseconds
#          expr       min        lq      mean    median        uq       max neval
#  get.pval1(1) 299.24079 304.37386 323.28321 307.78947 313.97311 482.32384   100
#  get.pval2(1)  14.23288  14.64717  15.80856  15.11013  16.38012  36.04724   100

With the vectorized code, your code should finish in about 177616*.01580856 = 2807.853 seconds, or about 45 minutes (compared to 16 hours for the original code). 使用矢量化代码，您的代码应在大约177616 * .01580856 = 2807.853秒内完成，或大约45分钟（原始代码为16小时）。 If this is still not fast enough for you, then I would encourage you to look at the parallel package in R. The mcmapply should give a good speedup for you, since each iteration of the outer for loop is independent. 如果这仍然不够快，那么我鼓励您看一下R中的parallel包mcmapply应该为您提供良好的加速，因为外部for循环的每次迭代都是独立的。

如何使我的循环在R中运行更快？

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-11-24 17:01:21

如何使我的循环在R中运行更快？

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-11-24 17:01:21

解决方案1
3 已采纳 2014-11-24 17:01:21