R: faster alternative of period.apply

Question

I have the following data prepared

Timestamp   Weighted Value  SumVal  Group
1           1600            800     1
2           1000            1000    2
3           1000            1000    2
4           1000            1000    2
5           800             500     3
6           400             500     3
7           2000            800     4
8           1200            1000    4

I want to calculate for each group sum(Weighted_Value)/sum(SumVal), so for example for Group 3 the result would be 1.2.

I was using period.apply to do that:

period.apply(x4, intervalIndex, function(z) sum(z[,4])/sum(z[,2]))

But it's too slow for my application, so I wanted to ask if someone knows a faster alternative for that? I alsready tried ave, but it seems to be even slower.

My goal is btw. to calculate a time-weighted-average, to transfer an irregular time series into a time series with equi-distant-time intervals.

Thanks!

Answer 1

library(data.table)
setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group]

but I don't see the time series you are referring to. check out library(zoo) for that.

Answer 2

Using rowsum seems to be faster (at least for this small example dataset) than the data.table approach:

sgibb <- function(datframe) {
  data.frame(Group = unique(df$Group),
             Avg = rowsum(df$Weighted_Value, df$Group)/rowsum(df$SumVal, df$Group))
}

Adding the rowsum approach to @platfort's benchmark:

library(microbenchmark)
library(dplyr)
library(data.table)

microbenchmark(
  Nader   = df %>%
              group_by(Group) %>%
              summarise(res = sum(Weighted_Value) / sum(SumVal)),
  Henk    = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
  plafort = weight.avg(df),
  sgibb = sgibb(df)
)
# Unit: microseconds
#     expr      min       lq      mean    median        uq      max neval
#    Nader 2179.890 2280.462 2583.8798 2399.0885 2497.6000 6647.236   100
#     Henk  648.191  693.519  788.1421  726.0940  751.0810 2386.260   100
#  plafort 2638.967 2740.541 2935.4756 2785.7425 2909.4640 5000.652   100
#    sgibb  347.125  384.830  442.6447  409.2815  441.8935 2039.563   100

Answer 3

Try using dplyr it should be faster than base R

library(dplyr)

df <- read.table(text = "Timestamp   Weighted_Value  SumVal  Group
1           1600            800     1
2           1000            1000    2
3           1000            1000    2
4           1000            1000    2
5           800             500     3
6           400             500     3
7           2000            800     4
8           1200            1000    4" , header = T)


df %>%
  group_by(Group) %>%
  summarise(res = sum(Weighted_Value) / sum(SumVal))

Answer 4

Here's a base R solution. It's not the fastest for larger (500k+) datasets, but so you can see what may be happening "under the hood" in the other functions.

weight.avg <- function(datframe) {
  s <- split(datframe, datframe$Group)
  avg <- sapply(s, function(x) sum(x[ ,2]) / sum(x[ ,3]))
  data.frame(Group = names(avg), Avg = avg)
}

weight.avg(df)
  Group      Avg
1     1 2.000000
2     2 1.000000
3     3 1.200000
4     4 1.777778

The first line of the function splits the data frame by Group. The second applies the formula to each Group. The last creates a new data frame.

Data

df <- read.table(text = "Timestamp   Weighted_Value  SumVal  Group
                 1           1600            800     1
                 2           1000            1000    2
                 3           1000            1000    2
                 4           1000            1000    2
                 5           800             500     3
                 6           400             500     3
                 7           2000            800     4
                 8           1200            1000    4" , header = T)

Fastest Time

library(microbenchmark)
library(dplyr)
library(data.table)

microbenchmark(
  Nader   = df %>%
              group_by(Group) %>%
              summarise(res = sum(Weighted_Value) / sum(SumVal)),
  Henk    = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
  plafort = weight.avg(df)
)
Unit: microseconds
    expr      min        lq      mean   median       uq      max
   Nader 2619.174 2827.0100 3094.5570 2949.976 3107.481 7980.684
    Henk  783.186  833.7155  932.5883  888.783  944.640 3275.646
 plafort 3550.787 3772.4395 4085.2323 3853.561 3995.869 7595.801

R: faster alternative of period.apply

Question

4 answers

solution1
3 2015-05-20 16:34:41

solution2
3 ACCPTED 2015-05-20 18:46:58

solution3
2 2015-05-20 16:30:26

solution4
2 2015-05-20 17:50:36

Data

Fastest Time

R: faster alternative of period.apply

Question

4 answers

solution1 3 2015-05-20 16:34:41

solution2 3 ACCPTED 2015-05-20 18:46:58

solution3 2 2015-05-20 16:30:26

solution4 2 2015-05-20 17:50:36

Data

Fastest Time

solution1
3 2015-05-20 16:34:41

solution2
3 ACCPTED 2015-05-20 18:46:58

solution3
2 2015-05-20 16:30:26

solution4
2 2015-05-20 17:50:36