[英]R: faster alternative of period.apply
I have the following data prepared 我准备了以下数据
Timestamp Weighted Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4
I want to calculate for each group sum(Weighted_Value)/sum(SumVal), so for example for Group 3 the result would be 1.2. 我想计算每个组的总和(Weighted_Value)/ sum(SumVal),所以例如对于组3,结果将是1.2。
I was using period.apply to do that: 我正在使用period.apply来做到这一点:
period.apply(x4, intervalIndex, function(z) sum(z[,4])/sum(z[,2]))
But it's too slow for my application, so I wanted to ask if someone knows a faster alternative for that? 但是对于我的应用程序来说它太慢了,所以我想问一下是否有人知道更快的替代方案呢? I alsready tried ave, but it seems to be even slower.
我已经尝试过了,但它似乎更慢了。
My goal is btw. 我的目标是btw。 to calculate a time-weighted-average, to transfer an irregular time series into a time series with equi-distant-time intervals.
计算时间加权平均值,将不规则时间序列转换为具有等距时间间隔的时间序列。
Thanks! 谢谢!
library(data.table)
setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group]
but I don't see the time series you are referring to. 但我没有看到你所指的时间序列。 check out library(zoo) for that.
看看图书馆(动物园)。
Using rowsum
seems to be faster (at least for this small example dataset) than the data.table
approach: 使用
rowsum
似乎比data.table
方法更快(至少对于这个小示例数据集):
sgibb <- function(datframe) {
data.frame(Group = unique(df$Group),
Avg = rowsum(df$Weighted_Value, df$Group)/rowsum(df$SumVal, df$Group))
}
Adding the rowsum
approach to @platfort's benchmark: 将
rowsum
方法添加到@ platfort的基准:
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(
Nader = df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal)),
Henk = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
plafort = weight.avg(df),
sgibb = sgibb(df)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Nader 2179.890 2280.462 2583.8798 2399.0885 2497.6000 6647.236 100
# Henk 648.191 693.519 788.1421 726.0940 751.0810 2386.260 100
# plafort 2638.967 2740.541 2935.4756 2785.7425 2909.4640 5000.652 100
# sgibb 347.125 384.830 442.6447 409.2815 441.8935 2039.563 100
Try using dplyr
it should be faster than base R
尝试使用
dplyr
它应该比基本R
更快
library(dplyr)
df <- read.table(text = "Timestamp Weighted_Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4" , header = T)
df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal))
Here's a base R solution. 这是一个基础R解决方案。 It's not the fastest for larger (500k+) datasets, but so you can see what may be happening "under the hood" in the other functions.
对于较大的(500k +)数据集来说,这不是最快的,但是你可以在其他函数中看到“引擎盖下”可能发生的事情。
weight.avg <- function(datframe) {
s <- split(datframe, datframe$Group)
avg <- sapply(s, function(x) sum(x[ ,2]) / sum(x[ ,3]))
data.frame(Group = names(avg), Avg = avg)
}
weight.avg(df)
Group Avg
1 1 2.000000
2 2 1.000000
3 3 1.200000
4 4 1.777778
The first line of the function splits the data frame by Group. 函数的第一行按组拆分数据框。 The second applies the formula to each Group.
第二个将公式应用于每个组。 The last creates a new data frame.
最后一个创建一个新的数据框。
df <- read.table(text = "Timestamp Weighted_Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4" , header = T)
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(
Nader = df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal)),
Henk = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
plafort = weight.avg(df)
)
Unit: microseconds
expr min lq mean median uq max
Nader 2619.174 2827.0100 3094.5570 2949.976 3107.481 7980.684
Henk 783.186 833.7155 932.5883 888.783 944.640 3275.646
plafort 3550.787 3772.4395 4085.2323 3853.561 3995.869 7595.801
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.