Vectorize function / increase calculation speed in data.table

Question

Currently I have the following data.table:

item      city    dummyvar
A        Austin       1
A        Austin       1
A        Austin      100
B        Austin       2 
B        Austin       2
B        Austin      200
A          NY         1
A          NY         1
A          NY        100
B          NY         2 
B          NY         2
B          NY        200

and I have a user-defined function called ImbalancePoints , which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar . The way I am doing this is as follows:

my.data.table[,
 .(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
 by = .(city, item)
]

And for the NY case lets say that I get a data.table object like the following:

 item   city   imb.points
   A     NY     3,449

where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A . However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.

I don't know if its of any help but I am also attaching the ImbalancePoints function:

library(pracma)

ImbalancePr <- function(eval.column) {
  n <- length(eval.column)
  imbalance <- rep(0, n)
  b_t = rep(0,n)
  elem_diff <- diff(eval.column)
  for(i in 2:n)
  {
    imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
    + imbalance[i-1]*(elem_diff[i-1] == 0)
  }
  return(imbalance)
}

ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
  bv_t <- ImbalancePr(eval.column)
  w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
  Tstar <- w0
  E0t <- Tstar
  repeat{
    Tlast <- sum(Tstar)
    nbt <- min(bkw_b, Tlast-1)
    P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
    P <- tail(P,1)
    bv_t_expected <- E0t * abs(P)
    bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
    if(max(bv_t_cumsum) < bv_t_expected){break}else{
      Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
    }
    Tlast <- Tlast + Tnew
    if(Tlast > length(eval.column)[1]){break}else{
      Tstar <- c(Tstar,Tnew)
      if(length(Tstar) <= 2){
        E0t <- mean(Tstar)
      }else{
        nt <- min(bkw_T,length(Tstar)-1)
        E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
        E0t <- tail(E0t,1)
      }
    }
  }
  return(sort(unique(Tstar)))
}

EDIT : Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.

Answer 1

This posting consist of several sections addressing different issues:

Vectorizing ImbalancePr()
Profiling ImbalancePoints()
Speeding-up movavg() with Rcpp by a factor of 4

Vectorizing `ImbalancePr()`

I believe ImbalancePr() can be replaced by

fImbalancePr <- function(x) c(0, sign(diff(x)))

At least, it returns the same result, wenn benchmarked (with check of results):

library(bench)
library(ggplot2)
bm <- press(
  n = c(10, 100, 1000, 10000),
  {
    x <- rep(0, n)
    set.seed(123)
    x[sample(n, n/5)] <- 100
    print(table(x))
    mark(
      ImbalancePr(x),
      fImbalancePr(x)
    )
  }
  
)

 Running with: n 1 10 x 0 100 8 2 2 100 x 0 100 80 20 3 1000 x 0 100 800 200 4 10000 x 0 100 8000 2000

autoplot(bm)

fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.

Profiling `ImbalancePoints()`

However, this improvement does not have much impact on the overall performance of ImbalancePoints() :

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      }
    )
  }
  
)

fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr() .

autoplot(bm)

There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.

We can use profvis to identify where the time is spent within ImbalancePoints() :

library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
  for (i in 1:100) ffImbalancePoints(x)
})

Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.

The results from one run are shown in this screenshot from RStudio:

movavg() consumes 25% of the time spent in ImbalancePoints() .
According to the profiling, another 20% are required for the double colon operator in pracma::movavg() . It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma) .
10% are spent in calls to tail() . tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.

If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:

...
else if (type == "e") {
    a <- 2/(n + 1)
    y[1] <- x[1]
    for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...

In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:

Choose a different weighted means function which uses only data points within a limited window.
Re-implement movavg() in C++ using Rcpp .

Speeding-up `movavg()` with `Rcpp`

Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.

EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)

library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
  int nx = x.size(); 
  double a = 2.0 / (n + 1.0);
  double b = 1.0 - a;
  double y;

  y = x[0];
  for(int k = 1; k < nx; k++){
    y = a * x[k] + b * y;
  }
  
  return y;
}
")

Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):

fImbalancePoints <-
  function(eval.column, 
           w0 = 100,
           bkw_T = 10,
           bkw_b = 10) {
    # bv_t <- ImbalancePr(eval.column)
    bv_t <- c(0, sign(diff(eval.column)))
    # w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
    w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
    Tstar <- w0
    E0t <- Tstar
    repeat {
      Tlast <- sum(Tstar)
      # remove warning: 
      # In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
      if (Tlast >= length(bv_t)) break
      nbt <- min(bkw_b, Tlast - 1)
      # P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
      # P <- tail(P, 1)
      P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
      bv_t_expected <- E0t * abs(P)
      bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
      if (max(bv_t_cumsum) < bv_t_expected) {
        break
      } else{
        Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
      }
      Tlast <- Tlast + Tnew
      if (Tlast > length(eval.column)[1]) {
        break
      } else{
        Tstar <- c(Tstar, Tnew)
        if (length(Tstar) <= 2) {
          E0t <- mean(Tstar)
        } else{
          nt <- min(bkw_T, length(Tstar) - 1)
          # E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
          # E0t <- tail(E0t, 1)
          E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
        }
      }
    }
    return(sort(unique(Tstar)))
  }

The benchmark

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      },
      min_time = 1
    )
  }
)
bm

 expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time <bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> 1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms 2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms 3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms 4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms 5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s 6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s

shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.

Vectorize function / increase calculation speed in data.table

Question

1 answers

solution1
0 2020-06-07 22:30:06

Vectorizing `ImbalancePr()`

Profiling `ImbalancePoints()`

Speeding-up `movavg()` with `Rcpp`

Vectorize function / increase calculation speed in data.table

Question

1 answers

solution1 0 2020-06-07 22:30:06

Vectorizing ImbalancePr()

Profiling ImbalancePoints()

Speeding-up movavg() with Rcpp

solution1
0 2020-06-07 22:30:06

Vectorizing `ImbalancePr()`

Profiling `ImbalancePoints()`

Speeding-up `movavg()` with `Rcpp`