在R中使此循环更快

Question

How can I speed up the following (noob) code: 如何加快以下（菜鸟）代码的速度：

#"mymatrix" is the matrix of word counts (docs X terms) 
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)

#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)

Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow. 问题是我正在处理的矩阵相当大（〜40kX100k），并且此代码非常慢。

The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. 我不使用“ apply”（而不是使用for循环和sapply）的原因是apply会给我所需矩阵的转置-我想要num_of_docs X num_of_words，但是apply会给我转置。 I will then have to spend more time computing the transpose and re-allocating it. 然后，我将不得不花费更多的时间来计算转置并重新分配它。

Any thoughts on making this faster? 有什么想法可以加快速度吗？

Thanks much. 非常感谢。

Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). 编辑：我发现下面的建议大大加快了我的代码（除了使我感到愚蠢）。 Any suggestions on where I can learn to write "optimized" R code from? 关于在哪里可以学习编写“优化的” R代码的任何建议？

Edit 2: OK, so something is not right. 编辑2：好，所以有些不对劲。 Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. 一旦我执行s.vec[!is.finite(s.vec)] <- 0 ，s.vec的每个元素都将设置为0。仅重申一下我的原始矩阵是一个包含整数的稀疏矩阵。 This is due to some quirk of the Matrix package I am using. 这是由于我使用的Matrix软件包有些古怪。 When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. 当我执行s.vec[which(s.vec==-Inf)] <- 0工作正常。 Thoughts? 有什么想法吗？

Answer 1

As per my comment, 根据我的评论，

#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1

justin <- function(){
    s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

joran <- function(){
    s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
    s.vec[!is.finite(s.vec)] <- 0
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

require(rbenchmark)    
benchmark(justin(),joran(),replications = 1000)

  test replications elapsed relative user.self sys.self user.child sys.child
2  joran()         1000   0.940  1.00000     0.842    0.105          0         0
1 justin()         1000   2.786  2.96383     2.617    0.187          0         0

So it's around 3x faster or so. 因此大约快了3倍。

Answer 2

not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. 不知道什么ndocs是，但ifelse已经是矢量，所以你应该能够使用ifelse声明，非经行和矩阵行走sapply沿行。 The same can be said for the final calc. 最终计算也可以这样说。

However, you haven't given a complete example to replicate... 但是，您尚未提供完整的示例来进行复制...

mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1

s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] <- s
}

all.equal(s.vec, tfmat)

so the only piece missing is the rowSums in your final calc. 因此唯一缺少的是最终计算中的rowSums 。

tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}

all.equal(tfmat, tfmat.vec)

在R中使此循环更快

问题描述

2 个解决方案

解决方案1
4 2012-03-05 19:10:14

解决方案2
3 2012-03-05 18:57:09

在R中使此循环更快

问题描述

2 个解决方案

解决方案1 4 2012-03-05 19:10:14

解决方案2 3 2012-03-05 18:57:09

解决方案1
4 2012-03-05 19:10:14

解决方案2
3 2012-03-05 18:57:09