简体   繁体   中英

Make this loop faster in R

How can I speed up the following (noob) code:

#"mymatrix" is the matrix of word counts (docs X terms) 
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)

#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)

Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.

The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.

Any thoughts on making this faster?

Thanks much.

Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?

Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?

As per my comment,

#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1

justin <- function(){
    s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

joran <- function(){
    s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
    s.vec[!is.finite(s.vec)] <- 0
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

require(rbenchmark)    
benchmark(justin(),joran(),replications = 1000)

  test replications elapsed relative user.self sys.self user.child sys.child
2  joran()         1000   0.940  1.00000     0.842    0.105          0         0
1 justin()         1000   2.786  2.96383     2.617    0.187          0         0

So it's around 3x faster or so.

not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.

However, you haven't given a complete example to replicate...

mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1

s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] <- s
}

all.equal(s.vec, tfmat)

so the only piece missing is the rowSums in your final calc.

tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}

all.equal(tfmat, tfmat.vec)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM