简体   繁体   English

在R中嵌套for循环更有效

[英]Making nested for loops in R more efficient

I am working on a research project where I want to determine equivalence of two distributions. 我正在研究一个研究项目,我想确定两个分布的等价性。 I am currently using the Mann-Whitney Test for Equivalence and the code I am running (below) was provided with the book Testing Statistical Hypotheses of Equivalence and Noninferiority by Stefan Wellek (2010). 我目前正在使用Mann-Whitney等效性测试,我正在运行的代码(如下)由Stefan Wellek撰写的“测试统计假设的等效性和非劣效性”一书(2010)。 Before running my data I am testing this code with random normal distributions which have the same mean and standard deviation. 在运行我的数据之前,我正在使用具有相同均值和标准差的随机正态分布来测试此代码。 My problem is that there are three nested for loops and when running larger distributions sizes (as in the example below) the code takes forever to run. 我的问题是有三个嵌套的for循环,当运行更大的分布大小时(如下例所示),代码需要永远运行。 If I only had to run it once that would not be such a problem, but I am doing a simulation test and creating power curves so I need to run many iterations of this code (around 10,000). 如果我只需要运行它就不会出现这样的问题,但我正在进行模拟测试并创建功率曲线,因此我需要运行此代码的多次迭代(大约10,000次)。 At the moment, depending on how I alter the distribution sizes, it takes days to run 10,000 iterations. 目前,根据我改变分布大小的方式,运行10,000次迭代需要数天。

Any help in a way to increase the performance of this would be greatly appreciated. 任何有助于提高性能的帮助都将非常感激。

x <- rnorm(n=125, m=3, sd=1)
y <- rnorm(n=500, m=3, sd=1)

alpha <- 0.05
m <- length(x)
n <- length(y)
eps1_ <- 0.2 #0.1382 default
eps2_ <- 0.2 #0.2602 default

eqctr <- 0.5 + (eps2_-eps1_)/2 
eqleng <- eps1_ + eps2_

wxy <- 0
pihxxy <- 0
pihxyy <- 0

for (i in 1:m)
 for (j in 1:n)
  wxy <- wxy + trunc(0.5*(sign(x[i] - y[j]) + 1))

for (i in 1:m)
 for (j1 in 1:(n-1))
  for (j2 in (j1+1):n)
    pihxyy <- pihxyy + trunc(0.5*(sign(x[i] - max(y[j1],y[j2])) + 1))

for (i1 in 1:(m-1))
 for (i2 in (i1+1):m)
  for (j in 1:n)
    pihxxy <- pihxxy + trunc(0.5*(sign(min(x[i1],x[i2]) - y[j]) + 1))

wxy <- wxy / (m*n)
pihxxy <- pihxxy*2 / (m*(m-1)*n)
pihxyy <- pihxyy*2 / (n*(n-1)*m)
sigmah <- sqrt((wxy-(m+n-1)*wxy**2+(m-1)*pihxxy+(n-1)*pihxyy)/(m*n))

crit <- sqrt(qchisq(alpha,1,(eqleng/2/sigmah)**2))

if (abs((wxy-eqctr)/sigmah) >= crit) rej <- 1
if (abs((wxy-eqctr)/sigmah) < crit)  rej <- 0

if (is.na(sigmah) || is.na(crit)) rej <- 1

MW_Decision <- rej

cat(" ALPHA =",alpha,"  M =",m,"  N =",n,"  EPS1_ =",eps1_,"  EPS2_ =",eps2_,
  "\n","WXY =",wxy,"  SIGMAH =",sigmah,"  CRIT =",crit,"  REJ=",MW_Decision)

You can use outer instead of the first double loop: 您可以使用outer而不是第一个双循环:

set.seed(42)

f1 <- function(x,y) {
 wxy <- 0
 for (i in 1:m)
  for (j in 1:n)
   wxy <- wxy + trunc(0.5*(sign(x[i] - y[j]) + 1))
 wxy
}

f2 <- function(x,y) sum(outer(x,y, function(x,y) trunc(0.5*(sign(x-y)+1))))

f1(x,y)
[1] 32041
f2(x,y)
[1] 32041

You get roughly 50x speedup: 你获得大约50倍的加速:

library(microbenchmark)
microbenchmark(f1(x,y),f2(x,y))
Unit: milliseconds
     expr        min         lq     median         uq      max neval
 f1(x, y) 138.223841 142.586559 143.642650 145.754241 183.0024   100
 f2(x, y)   1.846927   2.194879   2.677827   3.141236  21.1463   100

The other loops are trickier. 其他循环比较棘手。

See edit below for an even better suggestion 请参阅下面的编辑以获得更好的建议

One simple suggestion to get a bit of a speed boost is to byte compile your code. 获得一点速度提升的一个简单建议是对代码进行字节编译

For example, I wrapped your code into a function starting from the alpha <- 0.05 line and ran it on my laptop. 例如,我将代码包装到从alpha <- 0.05行开始的函数中,并在我的笔记本电脑上运行它。 Simply byte compiling your current code, it runs twice as fast. 只需字节编译您当前的代码,它运行速度快两倍。

set.seed(1234)
x <- rnorm(n=125, m=3, sd=1)
y <- rnorm(n=500, m=3, sd=1)

# f1 <- function(x,y){ ...your code...}

system.time(f1(x, y))
#   user  system elapsed 
# 33.249   0.008  33.278 

library(compiler)
f2 <- cmpfun(f1)

system.time(f2(x, y))

#   user  system elapsed 
# 17.162   0.002  17.170 

EDIT 编辑

I should add, this is the type of things that a different language would do much better than R. Have you looked at the Rcpp and the inline packages? 我应该补充一点,这是一种不同语言比R更好的东西。你看过Rcppinline包吗?

I've been curious to learn how to use them so I figured this was a good chance. 我一直很想学习如何使用它们,所以我认为这是一个很好的机会。

Here's a tweak of your code using the inline package and Fortran (since I'm more comfortable with that than C). 这是使用inline包和Fortran的代码调整(因为我比C更舒服)。 It wasn't hard at all (provided you know Fortran or C); 这一点并不难(只要你知道Fortran或C); I just followed the examples listed in cfunction . 我只是按照cfunction列出的示例进行cfunction

First, let's re-write your loops and compile them: 首先,让我们重新编写你的循环并编译它们:

library(inline)

# Fortran code for first loop
loop1code <- "
   integer i,  j1,  j2
   real*8 tmp
   do i = 1, m
      do j1 = 1, n-1
         do j2 = j1+1, n
            tmp = x(i) - max(y(j1),y(j2))
            if (tmp > 0.) pihxyy = pihxyy + 1
         end do
      end do
   end do
"    
# Compile the code and turn loop into a function
loop1fun <- cfunction(sig = signature(x="numeric", y="numeric", pihxyy="integer", m="integer", n="integer"), dim=c("(m)", "(n)", "", "", ""), loop1code, language="F95")

# Fortran code for second loop
loop2code <- "
   integer i1, i2,  j
   real*8 tmp
   do i1 = 1, m-1
      do i2 = i1+1, m
         do j = 1, n
            tmp = min(x(i1), x(i2)) - y(j)
            if (tmp > 0.) pihxxy = pihxxy + 1
         end do
      end do
   end do
"    
# Compile the code and turn loop into a function
loop2fun <- cfunction(sig = signature(x="numeric", y="numeric", pihxxy="integer", m="integer", n="integer"), dim=c("(m)", "(n)", "", "", ""), loop2code, language="F95")

Now let's create a new function that uses these. 现在让我们创建一个使用这些功能的新功能。 So it's not too long, I'll just sketch the key parts I modified from your code: 所以它不会太长,我只是简单描述我从你的代码中修改的关键部分:

f3 <- function(x, y){

  # ... code ...

# Remove old loop
## for (i in 1:m)
##  for (j1 in 1:(n-1))
##   for (j2 in (j1+1):n)
##     pihxyy <- pihxyy + trunc(0.5*(sign(x[i] - max(y[j1],y[j2])) + 1))

# Call new function from compiled code instead
pihxyy <- loop1fun(x, y, pihxyy, m, n)$pihxyy

# Remove second loop
## for (i1 in 1:(m-1))
##  for (i2 in (i1+1):m)
##   for (j in 1:n)
##     pihxxy <- pihxxy + trunc(0.5*(sign(min(x[i1],x[i2]) - y[j]) + 1))

# Call new compiled function for second loop
pihxxy <- loop2fun(x, y, pihxxy, m, n)$pihxxy

# ... code ...
}

And now we run it and voila, we get a huge speed boost! 现在我们运行它,瞧,我们得到了巨大的速度提升! :) :)

system.time(f3(x, y))
#   user  system elapsed 
    0.12    0.00    0.12 

I did check that it got the same results as your code, but you probably want to run some additional tests just in case. 我确实检查它是否与您的代码具有相同的结果,但您可能希望运行一些额外的测试以防万一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM