简体   繁体   English

R:更快的period.apply替代品

[英]R: faster alternative of period.apply

I have the following data prepared 我准备了以下数据

Timestamp   Weighted Value  SumVal  Group
1           1600            800     1
2           1000            1000    2
3           1000            1000    2
4           1000            1000    2
5           800             500     3
6           400             500     3
7           2000            800     4
8           1200            1000    4

I want to calculate for each group sum(Weighted_Value)/sum(SumVal), so for example for Group 3 the result would be 1.2. 我想计算每个组的总和(Weighted_Value)/ sum(SumVal),所以例如对于组3,结果将是1.2。

I was using period.apply to do that: 我正在使用period.apply来做到这一点:

period.apply(x4, intervalIndex, function(z) sum(z[,4])/sum(z[,2]))

But it's too slow for my application, so I wanted to ask if someone knows a faster alternative for that? 但是对于我的应用程序来说它太慢了,所以我想问一下是否有人知道更快的替代方案呢? I alsready tried ave, but it seems to be even slower. 我已经尝试过了,但它似乎更慢了。

My goal is btw. 我的目标是btw。 to calculate a time-weighted-average, to transfer an irregular time series into a time series with equi-distant-time intervals. 计算时间加权平均值,将不规则时间序列转换为具有等距时间间隔的时间序列。

Thanks! 谢谢!

library(data.table)
setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group]

but I don't see the time series you are referring to. 但我没有看到你所指的时间序列。 check out library(zoo) for that. 看看图书馆(动物园)。

Using rowsum seems to be faster (at least for this small example dataset) than the data.table approach: 使用rowsum似乎比data.table方法更快(至少对于这个小示例数据集):

sgibb <- function(datframe) {
  data.frame(Group = unique(df$Group),
             Avg = rowsum(df$Weighted_Value, df$Group)/rowsum(df$SumVal, df$Group))
}

Adding the rowsum approach to @platfort's benchmark: rowsum方法添加到@ platfort的基准:

library(microbenchmark)
library(dplyr)
library(data.table)

microbenchmark(
  Nader   = df %>%
              group_by(Group) %>%
              summarise(res = sum(Weighted_Value) / sum(SumVal)),
  Henk    = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
  plafort = weight.avg(df),
  sgibb = sgibb(df)
)
# Unit: microseconds
#     expr      min       lq      mean    median        uq      max neval
#    Nader 2179.890 2280.462 2583.8798 2399.0885 2497.6000 6647.236   100
#     Henk  648.191  693.519  788.1421  726.0940  751.0810 2386.260   100
#  plafort 2638.967 2740.541 2935.4756 2785.7425 2909.4640 5000.652   100
#    sgibb  347.125  384.830  442.6447  409.2815  441.8935 2039.563   100

Try using dplyr it should be faster than base R 尝试使用dplyr它应该比基本R更快

library(dplyr)

df <- read.table(text = "Timestamp   Weighted_Value  SumVal  Group
1           1600            800     1
2           1000            1000    2
3           1000            1000    2
4           1000            1000    2
5           800             500     3
6           400             500     3
7           2000            800     4
8           1200            1000    4" , header = T)


df %>%
  group_by(Group) %>%
  summarise(res = sum(Weighted_Value) / sum(SumVal))

Here's a base R solution. 这是一个基础R解决方案。 It's not the fastest for larger (500k+) datasets, but so you can see what may be happening "under the hood" in the other functions. 对于较大的(500k +)数据集来说,这不是最快的,但是你可以在其他函数中看到“引擎盖下”可能发生的事情。

weight.avg <- function(datframe) {
  s <- split(datframe, datframe$Group)
  avg <- sapply(s, function(x) sum(x[ ,2]) / sum(x[ ,3]))
  data.frame(Group = names(avg), Avg = avg)
}

weight.avg(df)
  Group      Avg
1     1 2.000000
2     2 1.000000
3     3 1.200000
4     4 1.777778

The first line of the function splits the data frame by Group. 函数的第一行按组拆分数据框。 The second applies the formula to each Group. 第二个将公式应用于每个组。 The last creates a new data frame. 最后一个创建一个新的数据框。

Data 数据

df <- read.table(text = "Timestamp   Weighted_Value  SumVal  Group
                 1           1600            800     1
                 2           1000            1000    2
                 3           1000            1000    2
                 4           1000            1000    2
                 5           800             500     3
                 6           400             500     3
                 7           2000            800     4
                 8           1200            1000    4" , header = T)

Fastest Time 最快的时间

library(microbenchmark)
library(dplyr)
library(data.table)

microbenchmark(
  Nader   = df %>%
              group_by(Group) %>%
              summarise(res = sum(Weighted_Value) / sum(SumVal)),
  Henk    = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
  plafort = weight.avg(df)
)
Unit: microseconds
    expr      min        lq      mean   median       uq      max
   Nader 2619.174 2827.0100 3094.5570 2949.976 3107.481 7980.684
    Henk  783.186  833.7155  932.5883  888.783  944.640 3275.646
 plafort 3550.787 3772.4395 4085.2323 3853.561 3995.869 7595.801

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM