快速滚动均值+总结

Question

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. 在R中，我试图使用不同的窗口宽度对一个大矢量（高达400k元素）进行非常快速的滚动均值，然后对于每个窗口宽度，按每年的最大值汇总数据。 The example below will hopefully be clear. 希望下面的例子很清楚。 I have tried several approaches, and the fastest up to now seems to be using roll_mean from the package RcppRoll for the running mean, and aggregate for picking the maximum. 我尝试了好几种方法，并以最快的到现在为止好像是用roll_mean从包装RcppRoll的运行平均值， aggregate采摘的最大值。 Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; 请注意内存需求是一个问题：下面的版本需要非常少的内存，因为它一次只进行一次滚动均值和聚合; this is preferred. 这是首选。

#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
                 year=sort(sample(2001:2014, size=n, replace=TRUE))
                 ) 

ww <- 1:120 #Vector of window widths

dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {
  #Do the rolling mean for this ww
  df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
  #Aggregate maxima for each year
  dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine
dfsumm

This gives the desired output: a data.frame with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year. 这给出了所需的输出：包含15行（2001年至2015年）和120列（窗口宽度）的data.frame ，其中包含每个ww和每年的最大值。

However, it still takes too long to compute (as I have to compute thousands of these). 但是，计算时间仍然太长（因为我必须计算数千个）。 I have tried playing around with other options, namely dplyr and data.table , but I've been unable to find something faster due to my lack of knowledge of those packages. 我尝试过使用其他选项，即dplyr和data.table ，但由于我对这些软件包缺乏了解，我一直无法找到更快的东西。

Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)? 哪个是最快的方法， 使用单个核心 （代码已在其他地方并行化）？

Answer 1

Memory management, ie allocation and copies, is killing you with your approach. 内存管理，即分配和复制，正在以你的方法杀死你。

Here is a data.table approach, which assigns by reference: 这是一个data.table方法，通过引用分配：

library(data.table)
setDT(df)
alloc.col(df, 200) #allocate sufficient columns

#assign rolling means in a loop
for (i in seq_along(ww)) 
  set(df, j = paste0("D", i),  value = roll_mean(df[["rawdata"]], 
                                        ww[i], na.rm=TRUE, fill=NA))

dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate

Answer 2

Using new frollmean function (added in data.table v1.12.0) you can do the following 使用新的frollmean函数（在data.table v1.12.0中添加），您可以执行以下操作

th = setDTthreads(1L)
df[, paste0("D",ww) := frollmean(rawdata, ww, na.rm=TRUE)]
dfsumm <- df[, lapply(.SD, max, na.rm=TRUE), by=year]
setDTthreads(th)

You should consider shifting your parallelism down, as this use case is well parallelized in frollmean . 你应该考虑改变你的并行性，因为这个用例在frollmean很好地并行化了。 Also grouping operation is utilizing parallel processing. 分组操作也使用并行处理。

Answer 3

One performance issue you create is using dynamically growing a vector using cbind . 您创建的一个性能问题是使用cbind动态增长向量。 You could try to allocate the expected size beforehand, and later populating it using dfsumm[x] <- y . 您可以尝试预先分配预期大小，然后使用dfsumm[x] <- y填充它。

快速滚动均值+总结

问题描述

3 个解决方案

解决方案1
9 已采纳 2016-08-12 13:42:00

解决方案2
2 2018-12-09 14:59:25

解决方案3
-1 2016-08-12 12:09:07

快速滚动均值+总结

问题描述

3 个解决方案

解决方案1 9 已采纳 2016-08-12 13:42:00

解决方案2 2 2018-12-09 14:59:25

解决方案3 -1 2016-08-12 12:09:07

解决方案1
9 已采纳 2016-08-12 13:42:00

解决方案2
2 2018-12-09 14:59:25

解决方案3
-1 2016-08-12 12:09:07