简体   繁体   中英

Vectorize function / increase calculation speed in data.table

Currently I have the following data.table:

item      city    dummyvar
A        Austin       1
A        Austin       1
A        Austin      100
B        Austin       2 
B        Austin       2
B        Austin      200
A          NY         1
A          NY         1
A          NY        100
B          NY         2 
B          NY         2
B          NY        200

and I have a user-defined function called ImbalancePoints , which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar . The way I am doing this is as follows:

my.data.table[,
 .(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
 by = .(city, item)
]

And for the NY case lets say that I get a data.table object like the following:

 item   city   imb.points
   A     NY     3,449

where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A . However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.

I don't know if its of any help but I am also attaching the ImbalancePoints function:

library(pracma)

ImbalancePr <- function(eval.column) {
  n <- length(eval.column)
  imbalance <- rep(0, n)
  b_t = rep(0,n)
  elem_diff <- diff(eval.column)
  for(i in 2:n)
  {
    imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
    + imbalance[i-1]*(elem_diff[i-1] == 0)
  }
  return(imbalance)
}

ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
  bv_t <- ImbalancePr(eval.column)
  w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
  Tstar <- w0
  E0t <- Tstar
  repeat{
    Tlast <- sum(Tstar)
    nbt <- min(bkw_b, Tlast-1)
    P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
    P <- tail(P,1)
    bv_t_expected <- E0t * abs(P)
    bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
    if(max(bv_t_cumsum) < bv_t_expected){break}else{
      Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
    }
    Tlast <- Tlast + Tnew
    if(Tlast > length(eval.column)[1]){break}else{
      Tstar <- c(Tstar,Tnew)
      if(length(Tstar) <= 2){
        E0t <- mean(Tstar)
      }else{
        nt <- min(bkw_T,length(Tstar)-1)
        E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
        E0t <- tail(E0t,1)
      }
    }
  }
  return(sort(unique(Tstar)))
}

EDIT : Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.

This posting consist of several sections addressing different issues:

  • Vectorizing ImbalancePr()
  • Profiling ImbalancePoints()
  • Speeding-up movavg() with Rcpp by a factor of 4

Vectorizing ImbalancePr()

I believe ImbalancePr() can be replaced by

fImbalancePr <- function(x) c(0, sign(diff(x)))

At least, it returns the same result, wenn benchmarked (with check of results):

library(bench)
library(ggplot2)
bm <- press(
  n = c(10, 100, 1000, 10000),
  {
    x <- rep(0, n)
    set.seed(123)
    x[sample(n, n/5)] <- 100
    print(table(x))
    mark(
      ImbalancePr(x),
      fImbalancePr(x)
    )
  }
  
)
 Running with: n 1 10 x 0 100 8 2 2 100 x 0 100 80 20 3 1000 x 0 100 800 200 4 10000 x 0 100 8000 2000
autoplot(bm)

在此处输入图像描述

fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.

Profiling ImbalancePoints()

However, this improvement does not have much impact on the overall performance of ImbalancePoints() :

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      }
    )
  }
  
)

fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr() .

autoplot(bm)

在此处输入图像描述

There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.

We can use profvis to identify where the time is spent within ImbalancePoints() :

library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
  for (i in 1:100) ffImbalancePoints(x)
})

Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.

The results from one run are shown in this screenshot from RStudio:

在此处输入图像描述

  • movavg() consumes 25% of the time spent in ImbalancePoints() .
  • According to the profiling, another 20% are required for the double colon operator in pracma::movavg() . It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma) .
  • 10% are spent in calls to tail() . tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.

If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:

...
else if (type == "e") {
    a <- 2/(n + 1)
    y[1] <- x[1]
    for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...

In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:

  • Choose a different weighted means function which uses only data points within a limited window.
  • Re-implement movavg() in C++ using Rcpp .

Speeding-up movavg() with Rcpp

Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.

EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)

library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
  int nx = x.size(); 
  double a = 2.0 / (n + 1.0);
  double b = 1.0 - a;
  double y;

  y = x[0];
  for(int k = 1; k < nx; k++){
    y = a * x[k] + b * y;
  }
  
  return y;
}
")

Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):

fImbalancePoints <-
  function(eval.column, 
           w0 = 100,
           bkw_T = 10,
           bkw_b = 10) {
    # bv_t <- ImbalancePr(eval.column)
    bv_t <- c(0, sign(diff(eval.column)))
    # w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
    w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
    Tstar <- w0
    E0t <- Tstar
    repeat {
      Tlast <- sum(Tstar)
      # remove warning: 
      # In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
      if (Tlast >= length(bv_t)) break
      nbt <- min(bkw_b, Tlast - 1)
      # P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
      # P <- tail(P, 1)
      P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
      bv_t_expected <- E0t * abs(P)
      bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
      if (max(bv_t_cumsum) < bv_t_expected) {
        break
      } else{
        Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
      }
      Tlast <- Tlast + Tnew
      if (Tlast > length(eval.column)[1]) {
        break
      } else{
        Tstar <- c(Tstar, Tnew)
        if (length(Tstar) <= 2) {
          E0t <- mean(Tstar)
        } else{
          nt <- min(bkw_T, length(Tstar) - 1)
          # E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
          # E0t <- tail(E0t, 1)
          E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
        }
      }
    }
    return(sort(unique(Tstar)))
  }

The benchmark

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      },
      min_time = 1
    )
  }
)
bm
 expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time <bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> 1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms 2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms 3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms 4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms 5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s 6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s

shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM