如何加速或矢量化for循环？

Question

I would like to increase the speed of my for loop via vectorization or using Data.table or something else. 我想通过矢量化或使用Data.table或其他方法来提高for循环的速度。 I have to run the code on 1,000,000 rows and my code is really slow. 我必须在1,000,000行上运行代码，我的代码非常慢。

The code is fairly self-explanatory. 代码是相当不言自明的。 I have included an explanation below just in case. 我在下面提供了一个解释，以防万一。 I have included the input and the output of the function. 我已经包含了函数的输入和输出。 Hopefully you will help me make the function faster. 希望你能帮助我更快地完成这项功能。

My goal is to bin the vector "Volume", where each bin is equal to 100 shares. 我的目标是将矢量“容量”装箱，其中每个容器等于100份。 The vector "Volume" contains the number of shares traded. 向量“卷”包含交易的股票数量。 Here is what it looks like: 这是它的样子：

head(Volume, n = 60)
[1]  5  3  1  5  3  1  1  1  1  1  1  1 18  1  1 18  2  7 13  2  7 13  3  2  1  1  3  2  1  1  1
[32]  1  6  6  1  1  1  1  1  1  1  1 18  2  1  1  2  1 14 18  2  1  1  2  1 14  1  1  9  5

The vector "binIdexVector" is the same length of "Volume", and it contains the bin number; 向量“binIdexVector”与“Volume”的长度相同，它包含bin号; that is each element of the first 100 shares get the number 1, each elements of the next 100 shares get the number 2, each elements of the next 100 shares get the number 3, and so on. 即前100个股票的每个元素得到数字1，接下来100个股票的每个元素得到数字2，接下来100个股票的每个元素得到数字3，依此类推。 Here is what that vector looks like: 这是矢量的样子：

 head(binIdexVector, n = 60)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[48] 2 2 3 3 3 3 3 3 3 3 3 3 3

Here is my function : 这是我的功能 ：

#input as a vector
Volume<-c(5L, 3L, 1L, 5L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 
                   18L, 2L, 7L, 13L, 2L, 7L, 13L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 
                   1L, 1L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 2L, 1L, 
                   1L, 2L, 1L, 14L, 18L, 2L, 1L, 1L, 2L, 1L, 14L, 1L, 1L, 9L, 5L, 
                   2L, 1L, 1L, 1L, 1L, 9L, 5L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 
                   1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 2L, 9L, 9L, 3L, 3L, 1L, 1L, 
                   1L, 1L, 5L, 5L, 8L, 8L, 2L, 1L, 2L, 1L, 10L, 10L, 10L, 10L, 10L, 
                   10L, 10L, 10L, 9L, 9L, 1L, 1L, 8L, 1L, 8L, 1L, 8L, 8L, 2L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                   1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 
                   1L, 2L, 7L, 1L, 2L, 7L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
                   1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 
                   10L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 3L, 1L, 1L, 1L, 4L, 3L, 1L, 
                   1L, 1L, 4L, 25L, 1L, 1L, 25L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L)

binIdexVector <- numeric(length(Volume))

# initialize 
binIdex <-1
totalVolume <-0

for(i in seq_len(length(Volume))){

  totalVolume <- totalVolume + Volume[i]  

  if (totalVolume <= 100) {

    binIdexVector[i] <- binIdex

  } else {

    binIdex <- binIdex + 1
    binIdexVector[i] <- binIdex
    totalVolume <- Volume[i]
  }
}

# output:
> dput(binIdexVector)
c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
  2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
  4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 
  10, 10, 10, 10, 10, 10, 10, 10, 10, 10)

Thank a lot for your help! 非常感谢您的帮助！

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.2

Answer 1

You can use Rcpp when vectorization is difficult. 当矢量化困难时，您可以使用Rcpp。

library(Rcpp)
cppFunction('
  IntegerVector bin(NumericVector Volume, int n) {
    IntegerVector binIdexVector(Volume.size());
    int binIdex = 1;
    double totalVolume =0;

    for(int i=0; i<Volume.size(); i++){
      totalVolume = totalVolume + Volume[i];
      if (totalVolume <= n) {
        binIdexVector[i] = binIdex;
      } else {
        binIdex++;
        binIdexVector[i] = binIdex;
        totalVolume = Volume[i];
      }
    }
    return binIdexVector;
  }')

all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE

It's faster than findInterval(cumsum(Volume), seq(0, sum(Volume), by=100)) (which of course gives an inexact answer) 它比findInterval(cumsum(Volume), seq(0, sum(Volume), by=100))快findInterval(cumsum(Volume), seq(0, sum(Volume), by=100))当然这给出了一个不准确的答案）

Answer 2

Volume<-sample(1:5,500,replace=TRUE)
binLabels<- cumsum(diff(cumsum(Volume) %% 100) <0) + 1

This results in the vector binLabels which indicates which bin each data point belongs to. 这导致向量binLabels指示每个数据点属于哪个bin。 Each bin will hold the number of data points required such that the sum of the data points is 100. 每个bin将保持所需的数据点数，使得数据点的总和为100。

如何加速或矢量化for循环？

问题描述

2 个解决方案

解决方案1
12 已采纳 2015-03-14 23:24:17

解决方案2
0 2015-03-14 21:41:04

如何加速或矢量化for循环？

问题描述

2 个解决方案

解决方案1 12 已采纳 2015-03-14 23:24:17

解决方案2 0 2015-03-14 21:41:04

解决方案1
12 已采纳 2015-03-14 23:24:17

解决方案2
0 2015-03-14 21:41:04