简体   繁体   English

快速滚动均值+总结

[英]Fast rolling mean + summarize

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. 在R中,我试图使用不同的窗口宽度对一个大矢量(高达400k元素)进行非常快速的滚动均值,然后对于每个窗口宽度,按每年的最大值汇总数据。 The example below will hopefully be clear. 希望下面的例子很清楚。 I have tried several approaches, and the fastest up to now seems to be using roll_mean from the package RcppRoll for the running mean, and aggregate for picking the maximum. 我尝试了好几种方法,并以最快的到现在为止好像是用roll_mean从包装RcppRoll的运行平均值, aggregate采摘的最大值。 Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; 请注意内存需求是一个问题:下面的版本需要非常少的内存,因为它一次只进行一次滚动均值和聚合; this is preferred. 这是首选。

#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
                 year=sort(sample(2001:2014, size=n, replace=TRUE))
                 ) 

ww <- 1:120 #Vector of window widths

dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {
  #Do the rolling mean for this ww
  df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
  #Aggregate maxima for each year
  dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine
dfsumm

This gives the desired output: a data.frame with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year. 这给出了所需的输出:包含15行(2001年至2015年)和120列(窗口宽度)的data.frame ,其中包含每个ww和每年的最大值。

However, it still takes too long to compute (as I have to compute thousands of these). 但是,计算时间仍然太长(因为我必须计算数千个)。 I have tried playing around with other options, namely dplyr and data.table , but I've been unable to find something faster due to my lack of knowledge of those packages. 我尝试过使用其他选项,即dplyrdata.table ,但由于我对这些软件包缺乏了解,我一直无法找到更快的东西。

Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)? 哪个是最快的方法, 使用单个核心 (代码已在其他地方并行化)?

Memory management, ie allocation and copies, is killing you with your approach. 内存管理,即分配和复制,正在以你的方法杀死你。

Here is a data.table approach, which assigns by reference: 这是一个data.table方法,通过引用分配:

library(data.table)
setDT(df)
alloc.col(df, 200) #allocate sufficient columns

#assign rolling means in a loop
for (i in seq_along(ww)) 
  set(df, j = paste0("D", i),  value = roll_mean(df[["rawdata"]], 
                                        ww[i], na.rm=TRUE, fill=NA))

dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate

Using new frollmean function (added in data.table v1.12.0) you can do the following 使用新的frollmean函数(在data.table v1.12.0中添加),您可以执行以下操作

th = setDTthreads(1L)
df[, paste0("D",ww) := frollmean(rawdata, ww, na.rm=TRUE)]
dfsumm <- df[, lapply(.SD, max, na.rm=TRUE), by=year]
setDTthreads(th)

You should consider shifting your parallelism down, as this use case is well parallelized in frollmean . 你应该考虑改变你的并行性,因为这个用例在frollmean很好地并行化了。 Also grouping operation is utilizing parallel processing. 分组操作也使用并行处理。

One performance issue you create is using dynamically growing a vector using cbind . 您创建的一个性能问题是使用cbind动态增长向量。 You could try to allocate the expected size beforehand, and later populating it using dfsumm[x] <- y . 您可以尝试预先分配预期大小,然后使用dfsumm[x] <- y填充它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM