Currently I have the following data.table:
item city dummyvar
A Austin 1
A Austin 1
A Austin 100
B Austin 2
B Austin 2
B Austin 200
A NY 1
A NY 1
A NY 100
B NY 2
B NY 2
B NY 200
and I have a user-defined function called ImbalancePoints
, which is applied to dummyvar
and it returns the rows where it detects an abrupt change in dummyvar
. The way I am doing this is as follows:
my.data.table[,
.(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
by = .(city, item)
]
And for the NY
case lets say that I get a data.table
object like the following:
item city imb.points
A NY 3,449
where the column imb.points
is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY
and item = A
. However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.
I don't know if its of any help but I am also attaching the ImbalancePoints
function:
library(pracma)
ImbalancePr <- function(eval.column) {
n <- length(eval.column)
imbalance <- rep(0, n)
b_t = rep(0,n)
elem_diff <- diff(eval.column)
for(i in 2:n)
{
imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
+ imbalance[i-1]*(elem_diff[i-1] == 0)
}
return(imbalance)
}
ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
bv_t <- ImbalancePr(eval.column)
w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
Tstar <- w0
E0t <- Tstar
repeat{
Tlast <- sum(Tstar)
nbt <- min(bkw_b, Tlast-1)
P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
P <- tail(P,1)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if(max(bv_t_cumsum) < bv_t_expected){break}else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if(Tlast > length(eval.column)[1]){break}else{
Tstar <- c(Tstar,Tnew)
if(length(Tstar) <= 2){
E0t <- mean(Tstar)
}else{
nt <- min(bkw_T,length(Tstar)-1)
E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
E0t <- tail(E0t,1)
}
}
}
return(sort(unique(Tstar)))
}
EDIT : Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints
function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.
This posting consist of several sections addressing different issues:
ImbalancePr()
ImbalancePoints()
movavg()
with Rcpp
by a factor of 4ImbalancePr()
I believe ImbalancePr()
can be replaced by
fImbalancePr <- function(x) c(0, sign(diff(x)))
At least, it returns the same result, wenn benchmarked (with check of results):
library(bench)
library(ggplot2)
bm <- press(
n = c(10, 100, 1000, 10000),
{
x <- rep(0, n)
set.seed(123)
x[sample(n, n/5)] <- 100
print(table(x))
mark(
ImbalancePr(x),
fImbalancePr(x)
)
}
)
Running with: n 1 10 x 0 100 8 2 2 100 x 0 100 80 20 3 1000 x 0 100 800 200 4 10000 x 0 100 8000 2000
autoplot(bm)
fImbalancePr()
is always faster than OP's original version. The speed advantage increases with vector length.
ImbalancePoints()
However, this improvement does not have much impact on the overall performance of ImbalancePoints()
:
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
}
)
}
)
fImbalancePoint()
is a variant of ImbalancePoint()
where ImbalancePr()
has been replaced by fImbalancePr()
.
autoplot(bm)
There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.
We can use profvis
to identify where the time is spent within ImbalancePoints()
:
library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
for (i in 1:100) ffImbalancePoints(x)
})
Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.
The results from one run are shown in this screenshot from RStudio:
movavg()
consumes 25% of the time spent in ImbalancePoints()
. pracma::movavg()
. It might be worthwhile to test if there is a speedup from loading the pracma
paackage beforehand using library(pracma)
.tail()
. tail(x, 1)
can be replaced by x[length(x)]
which is more than a magnitude faster. If we look at code of movavg()
by typing pracma::movavg
(without parentheses) we see that there is a iterative loop which cannot be vectorized:
...
else if (type == "e") {
a <- 2/(n + 1)
y[1] <- x[1]
for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...
In addition, only the last value of the time series created by the call to movavg()
is used. So, there might be two options for performance improvements here:
movavg()
in C++ using Rcpp
.movavg()
with Rcpp
Replacing the call to pracma::movavg()
and the subsequent call to tail()
by on Rcpp
function we can gain a speed-up up to a factor of 4 for ImbalancePoints()
overall.
EMA_last_cpp(x, n)
replaces tail(pracma::movavg(x, n, type = "e"), 1)
library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
int nx = x.size();
double a = 2.0 / (n + 1.0);
double b = 1.0 - a;
double y;
y = x[0];
for(int k = 1; k < nx; k++){
y = a * x[k] + b * y;
}
return y;
}
")
Now, we can modify ImbalancePoints()
accordingly. In addition, the call to ImbalancePr()
is replaced and the code is modified in two other places (see comments):
fImbalancePoints <-
function(eval.column,
w0 = 100,
bkw_T = 10,
bkw_b = 10) {
# bv_t <- ImbalancePr(eval.column)
bv_t <- c(0, sign(diff(eval.column)))
# w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
Tstar <- w0
E0t <- Tstar
repeat {
Tlast <- sum(Tstar)
# remove warning:
# In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
if (Tlast >= length(bv_t)) break
nbt <- min(bkw_b, Tlast - 1)
# P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
# P <- tail(P, 1)
P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if (max(bv_t_cumsum) < bv_t_expected) {
break
} else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if (Tlast > length(eval.column)[1]) {
break
} else{
Tstar <- c(Tstar, Tnew)
if (length(Tstar) <= 2) {
E0t <- mean(Tstar)
} else{
nt <- min(bkw_T, length(Tstar) - 1)
# E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
# E0t <- tail(E0t, 1)
E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
}
}
}
return(sort(unique(Tstar)))
}
The benchmark
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
},
min_time = 1
)
}
)
bm
expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time <bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> 1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms 2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms 3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms 4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms 5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s 6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s
shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.