简体   繁体   English

R - 计算前 k 个非 NA 值的滚动平均值

[英]R - Calculate rolling mean of previous k non-NA values

I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework.我正在尝试计算 dplyr/tidyverse 框架中前k个非 NA 值的滚动平均值。 I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this.我写了一个 function 似乎可以工作,但想知道是否已经有来自一些 package 的 function (这可能比我的尝试更有效) An example dataset:一个示例数据集:

tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

Let's say I want the rolling mean of the previous 3 non-NA values.假设我想要前 3 个非 NA 值的滚动平均值。 Then the output y should be:那么 output y应该是:

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

The first 5 elements of y are NA s because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. y的前 5 个元素是NA ,因为x第一次有 3 个先前的非 NA 值是在第 6 行,这 3 个元素的平均值是 2。接下来的y元素是不言自明的。 Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.第 9 行得到 4,因为x的前 3 个非 NA 值位于第 5、6 和 7 行,依此类推。

My attempt is this:我的尝试是这样的:

roll_mean_previous_k <- function(x, k){
  
  require(dplyr)
  
  res                      <- NA
  lagged_vector            <- dplyr::lag(x)
  lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
  previous_k_values        <- tail(lagged_vector_without_na, k)
  
  if (length(previous_k_values) >= k) res <- mean(previous_k_values)
  
  res
  
}

to be used as follows (using the slide_dbl function from the slider package):按如下方式使用(使用slide_dbl包中的slider ):

library(dplyr)

tmp.df %>% 
  mutate(
    y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
  )

which gives the desired output.这给出了所需的 output。 However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this.但是,我想知道是否有现成的,并且(如前所述)更有效的方法来做到这一点。 I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (eg ignoring them).我应该提到我分别从zooRcppRoll包中知道rollmeanroll_mean ,但除非我弄错了,否则它们似乎可以在固定滚动 window 上工作,并且可以选择处理NA值(例如忽略它们)。 In my case, I want to "extend" my window to include k non-NA values.就我而言,我想“扩展”我的 window 以包含k个非 NA 值。

Any thoughts/suggestions are welcome.欢迎任何想法/建议。

EDIT - SIMULATION RESULTS编辑 - 模拟结果

Thank you to all contributors.感谢所有贡献者。 First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome.首先,我没有提到我的数据集确实更大并且经常运行,因此任何性能改进都是最受欢迎的。 I therefore ran the following simulation to check execution times, before deciding which answer to accept.因此,在决定接受哪个答案之前,我运行了以下模拟来检查执行时间。 Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly.请注意,一些答案需要进行小调整才能返回所需的 output,但如果您认为您的解决方案被歪曲(因此效率低于预期),请随时告诉我,我会相应地进行编辑。 I've used G. Grothendieck 's trick from his answer below, to remove the need for if - else checks regarding the length of the lagged, non-NA vector.我在下面的回答中使用了G. Grothendieck的技巧,以消除对if - else检查滞后、非 NA 向量长度的需要。

So here's the simulation code:所以这是模拟代码:

library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)

set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA

# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
  
  runner(
    x = z, 
    f = function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    }
  )
  
}

# Based on my inital answer (but simplified), also mentioned by GoGonzo 
f_slider <- function(z, k){
  
  slide_dbl(
    z,
    function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    .before = Inf
  )
}

# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
  
    reduced_vec <-  na.omit(z)
    unique_means <-  rollapply(reduced_vec, width = k, mean)
    
    start <-  which(!is.na(z))[k] + 1
    repeater <-  which(is.na(z)) + 1
    repeater_cut <-  repeater[(repeater > start-1) & (repeater <= length(z))]
    
    final <- as.numeric(rep(NA, length(z)))
    index <-  start:length(z)
    final[setdiff(index, repeater_cut)] <- unique_means
    final[(start):length(final)] <- na.locf(final)
    final
}

# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
  
  rollapplyr(
    z, 
    seq_along(z), 
    function(x, k){
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    k)

}

# Based on AnilGoyal's answer
f_purrr <- function(z, k){
  
    map_dbl(
      seq_along(z), 
      ~ ifelse(
        length(tail(na.omit(z[1:(.x -1)]), k)) == k,
        mean(tail(na.omit(z[1:(.x -1)]), k)), 
        NA
        )
      )

}

# Check if all are identical #
all(
  sapply(
    list(
      # f_helios(test_vector, 10),
      f_purrr(test_vector, 10),
      f_runner(test_vector, 10),
      f_zoo(test_vector, 10)
    ), 
    FUN = identical, 
    f_slider(test_vector, 10),
  )
)

# Run benchmarking #
microbenchmark(
  # f_helios(test_vector, 10),
  f_purrr(test_vector, 10),
  f_runner(test_vector, 10),
  f_slider(test_vector, 10),
  f_zoo(test_vector, 10)
)

And the results:结果:

Unit: milliseconds
                      expr     min       lq     mean   median       uq      max neval  cld
  f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613   100   c 
 f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485  98.7239   100  b  
 f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250  32.1847   100 a   
    f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793   100    d

Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster.基于上述情况,除非代码可以进一步改进,否则sliderrunner解决方案似乎更快。 Any final suggestions are more than welcome.任何最终建议都非常受欢迎。

Many thanks for your time!!非常感谢您的宝贵时间!!

With runner it will be something like mean of 3-elements tail window of non-na values.对于跑步者,它将类似于非 na mean的 3 元素tail window 的平均值。 You can achive the same result with slider您可以使用 slider 获得相同的结果

library(runner)
tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

# using runner
tmp.df$y_runner <- runner(
  x = tmp.df$x, 
  f = function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }
)

# using slider
tmp.df$y_slider <- slider::slide_dbl(
  tmp.df$x, 
  function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }, 
  .before = Inf
)

tmp.df

#    x    y_runner y_slider
# 1  NA      NaN      NaN
# 2   1      1.0      1.0
# 3   2      1.5      1.5
# 4  NA      1.5      1.5
# 5   3      2.0      2.0
# 6   4      3.0      3.0
# 7   5      4.0      4.0
# 8  NA      4.0      4.0
# 9  NA      4.0      4.0
# 10 NA      4.0      4.0
# 11  6      5.0      5.0
# 12  7      6.0      6.0
# 13 NA      6.0      6.0

rollapplyr.滚动应用程序。 Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see?rollapply for more info.关于问题中关于 rollmean 的评论,zoo 也有 rollappy 和 rollapplyr(右对齐),它们通过指定一个向量(就像我们在这里所做的那样)或宽度列表来允许输入的每个组件有不同的宽度(和偏移)——看到吗?滚动申请以获取更多信息。 We use a relatively naive vector of widths below and also show some improved width vectors which run faster.我们在下面使用了一个相对简单的宽度向量,并且还展示了一些运行速度更快的改进宽度向量。

Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed.操作Create a Mean function 采用向量,删除最后一个元素和所有 NA,并根据需要将剩下的最后 k 个元素扩展到具有 NA 的 k 个元素。 Finally take the mean of that.最后取其平均值。 We use rollapplyr to apply that to x with a width of seq_along(x).我们使用 rollapplyr 将其应用于宽度为 seq_along(x) 的 x。

Performance improvements.性能改进。 With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:对于这些小数据,以下可能不会产生太大影响,但如果您有更大的数据,您可以尝试这些可能会提高速度的方法:

  • replace na.omit with na_rm from the collapse package用崩溃 package 中的 na_rm 替换 na.omit

  • replace the second argument of rollapplyr with the code shown here.用此处显示的代码替换 rollapplyr 的第二个参数。 The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider.这里的想法是,NA 的 k+1 个最长游程的长度之和加上 k+1 个 forms 是我们需要考虑的元素数量的界限。 This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.当我尝试使用 1300 行(由问题中的 100 个数据副本组成)并且没有添加太多额外代码时,这个(加上使用 na_rm)的运行速度比问题中的代码快了大约 25%。

     pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
  • replace the second argument of rollapplyr with w where w is shown here.用 w 替换 rollapplyr 的第二个参数,此处显示 w。 The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound.这里的想法是使用 findInterval 找到元素 k 非 NA 的背面,这提供了更紧密的界限。 This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.当尝试使用相同的 1300 行以增加 2 行代码为代价时,这个(加上使用 na_rm)的运行速度几乎是问题中代码的两倍。

     tt <- length(x) - rev(cumsum(rev(.is,na(x)))) w <- seq_along(tt) - findInterval(tt - k - 1, tt)

Code.代码。 With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.使用问题中的数据,下面的代码(不使用上述改进)比基于我的基准测试的问题中的代码运行得稍快(不是很多),并且它只有两行代码。

library(dplyr)
library(zoo)

Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))

giving:给予:

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably.由于我不知道在任何标准库中计算 output 的现成方法,我想出了下面的实现roll_mean_k_efficient ,这似乎大大加快了你的计算速度。 Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.请注意,此实现使用了zoo package 中的rollapplyna.locf方法。

rm(list = ls())

library("zoo")
library("rbenchmark")
library("dplyr")

x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)

# your sample (extended)
tmp.df <- data.frame(
  x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)

# enhanced implementation
roll_mean_k_efficient <- function(x, k){
  reduced_vec = na.omit(x)
  unique_means = rollapply(reduced_vec, width=k, mean)
  
  start = which(!is.na(x))[k] + 1
  repeater = which(is.na(x)) + 1
  repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
  
  final <- as.numeric(rep(NA, length(x)))
  index = start:length(x)
  final[setdiff(index, repeater_cut)] <- unique_means
  final[(start):length(final)] <- na.locf(final)
  final
}

# old implementation
roll_mean_previous_k <- function(x, k){
  res                      <- NA
  lagged_vector            <- dplyr::lag(x)
  lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
  previous_k_values        <- tail(lagged_vector_without_na, k)
  if (length(previous_k_values) >= k) res <- mean(previous_k_values)
  res
}

# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
  res = tmp.df %>% 
    mutate(
      y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
    ) 
  return(res)
}

# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3), 
          roll_mean_benchmark(), 
          columns=c('test','elapsed','replications'),
          replications = 100)

Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package.此外,我通过来自rbenchmark package 的benchmark function 扩展了您的示例向量x以获得一些更可靠的基准测试结果。 In my case the benchmark runtimes that are printed after running the code are:在我的情况下,运行代码后打印的基准运行时是:

                                 test elapsed replications
2               roll_mean_benchmark()   4.463          100
1 roll_mean_k_efficient(x = x, k = 3)   0.039          100

Without using zoo .不使用zoo In tidyverse fashion, you can also do it using purrr::maptidyverse中,您也可以使用purrr::map


tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3, 
                                                 mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)), 
                                                 NA)))

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM