简体   繁体   中英

Repeating replicate() in R without a loop

This question is mostly for my learning of good R programming practice. I'd like to repeat the replicate function with different inputs on a single variable for the expression within the replicate function. I can easily do this with a for loop, but I've heard repeatedly that if I'm using for loops in R, I'm doing it wrong. Is there a way to repeat a call to replicate using different inputs without a loop? After that, I have my best attempt so far.

Working Code with Loop:

set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)

cor(x, y)

cor.fxn <- function(N, x, y) {
  samp.row <- sample(1:1000, N)
  cor(x[samp.row], y[samp.row])
}

N.list <- seq(3,20)
cor.list <- rep(NA_real_, length(N.list))
for (N in N.list){
  cor.resamp <- replicate(1000, cor.fxn(N, x, y))
  cor.list[N-2] <- median(cor.resamp)
}
plot(N.list, cor.list)

Nonfunctional best attempt without loop:

set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)
X <- list(3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
eggs <- lapply(X, replicate, n=1000, expr=cor.fxn, x=x, y=y)

Which will error out:

Error in FUN(X[[i]], ...) : 
  unused arguments (x = c(9.17486389116665, 13.6573453081421, 12.2166561575586, 11.3619489970582, 17.9998611075272, 11.1171958860255, 20.4489048239365, 16.8825343591062, 12.9990097472942, 12.5617129892976, 10.9833420846924, 13.7732692244654, 16.9641205588413, 11.1309409503371, 11.7859737745279,...

Thank you for any assistance.

Looping is slow in R, but the other part that you probably didn't hear is that you should be vectorizing your operations. *apply family functions are not inherently faster than for loops. Let's look at some benchmarks

# Boiler plate code used for both functions

cor.fxn <- function(N, x, y) {
  samp.row <- sample(1:1000, N)
  cor(x[samp.row], y[samp.row])
}

set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)
N.list <- seq(3,20)

# Using 'for loop'
foo_a = function(....) {cor.list <- rep(NA_real_, length(N.list)); 
                          for (N in N.list) {
                            cor.resamp <- replicate(1000, cor.fxn(N, x, y))
                            cor.list[N-2] <- median(cor.resamp)
                          }
        }

# Using sapply
foo_b = function(...) sapply(3:20, function(n) median(replicate(1000, cor.fxn(n, x, y))))

microbenchmark(foo_a(), foo_b(), times = 100L)

Looks like there is no difference in timing between the two methods, as pointed out from above.

Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval
 foo_a() 939.7068 1041.964 1140.159 1146.065 1243.540 1367.411   100
 foo_b() 936.5962 1045.023 1138.337 1133.074 1239.099 1334.430   100

This specific test case can't be vectorized since you are taking the median of 1000 runs of a process. The whole point of this post is to point out that for loops are not inherantly worse than *apply family functions in R. However, you should always seek a vectorized solution over a looping/apply solution when possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM