简体   繁体   中英

R - Calculate rolling mean of previous k non-NA values

I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:

tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

The first 5 elements of y are NA s because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.

My attempt is this:

roll_mean_previous_k <- function(x, k){
  
  require(dplyr)
  
  res                      <- NA
  lagged_vector            <- dplyr::lag(x)
  lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
  previous_k_values        <- tail(lagged_vector_without_na, k)
  
  if (length(previous_k_values) >= k) res <- mean(previous_k_values)
  
  res
  
}

to be used as follows (using the slide_dbl function from the slider package):

library(dplyr)

tmp.df %>% 
  mutate(
    y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
  )

which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (eg ignoring them). In my case, I want to "extend" my window to include k non-NA values.

Any thoughts/suggestions are welcome.

EDIT - SIMULATION RESULTS

Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck 's trick from his answer below, to remove the need for if - else checks regarding the length of the lagged, non-NA vector.

So here's the simulation code:

library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)

set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA

# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
  
  runner(
    x = z, 
    f = function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    }
  )
  
}

# Based on my inital answer (but simplified), also mentioned by GoGonzo 
f_slider <- function(z, k){
  
  slide_dbl(
    z,
    function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    .before = Inf
  )
}

# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
  
    reduced_vec <-  na.omit(z)
    unique_means <-  rollapply(reduced_vec, width = k, mean)
    
    start <-  which(!is.na(z))[k] + 1
    repeater <-  which(is.na(z)) + 1
    repeater_cut <-  repeater[(repeater > start-1) & (repeater <= length(z))]
    
    final <- as.numeric(rep(NA, length(z)))
    index <-  start:length(z)
    final[setdiff(index, repeater_cut)] <- unique_means
    final[(start):length(final)] <- na.locf(final)
    final
}

# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
  
  rollapplyr(
    z, 
    seq_along(z), 
    function(x, k){
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    k)

}

# Based on AnilGoyal's answer
f_purrr <- function(z, k){
  
    map_dbl(
      seq_along(z), 
      ~ ifelse(
        length(tail(na.omit(z[1:(.x -1)]), k)) == k,
        mean(tail(na.omit(z[1:(.x -1)]), k)), 
        NA
        )
      )

}

# Check if all are identical #
all(
  sapply(
    list(
      # f_helios(test_vector, 10),
      f_purrr(test_vector, 10),
      f_runner(test_vector, 10),
      f_zoo(test_vector, 10)
    ), 
    FUN = identical, 
    f_slider(test_vector, 10),
  )
)

# Run benchmarking #
microbenchmark(
  # f_helios(test_vector, 10),
  f_purrr(test_vector, 10),
  f_runner(test_vector, 10),
  f_slider(test_vector, 10),
  f_zoo(test_vector, 10)
)

And the results:

Unit: milliseconds
                      expr     min       lq     mean   median       uq      max neval  cld
  f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613   100   c 
 f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485  98.7239   100  b  
 f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250  32.1847   100 a   
    f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793   100    d

Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.

Many thanks for your time!!

With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider

library(runner)
tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

# using runner
tmp.df$y_runner <- runner(
  x = tmp.df$x, 
  f = function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }
)

# using slider
tmp.df$y_slider <- slider::slide_dbl(
  tmp.df$x, 
  function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }, 
  .before = Inf
)

tmp.df

#    x    y_runner y_slider
# 1  NA      NaN      NaN
# 2   1      1.0      1.0
# 3   2      1.5      1.5
# 4  NA      1.5      1.5
# 5   3      2.0      2.0
# 6   4      3.0      3.0
# 7   5      4.0      4.0
# 8  NA      4.0      4.0
# 9  NA      4.0      4.0
# 10 NA      4.0      4.0
# 11  6      5.0      5.0
# 12  7      6.0      6.0
# 13 NA      6.0      6.0

rollapplyr. Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see?rollapply for more info. We use a relatively naive vector of widths below and also show some improved width vectors which run faster.

Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed. Finally take the mean of that. We use rollapplyr to apply that to x with a width of seq_along(x).

Performance improvements. With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:

  • replace na.omit with na_rm from the collapse package

  • replace the second argument of rollapplyr with the code shown here. The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider. This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.

     pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
  • replace the second argument of rollapplyr with w where w is shown here. The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound. This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.

     tt <- length(x) - rev(cumsum(rev(.is,na(x)))) w <- seq_along(tt) - findInterval(tt - k - 1, tt)

Code. With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.

library(dplyr)
library(zoo)

Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))

giving:

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably. Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.

rm(list = ls())

library("zoo")
library("rbenchmark")
library("dplyr")

x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)

# your sample (extended)
tmp.df <- data.frame(
  x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)

# enhanced implementation
roll_mean_k_efficient <- function(x, k){
  reduced_vec = na.omit(x)
  unique_means = rollapply(reduced_vec, width=k, mean)
  
  start = which(!is.na(x))[k] + 1
  repeater = which(is.na(x)) + 1
  repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
  
  final <- as.numeric(rep(NA, length(x)))
  index = start:length(x)
  final[setdiff(index, repeater_cut)] <- unique_means
  final[(start):length(final)] <- na.locf(final)
  final
}

# old implementation
roll_mean_previous_k <- function(x, k){
  res                      <- NA
  lagged_vector            <- dplyr::lag(x)
  lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
  previous_k_values        <- tail(lagged_vector_without_na, k)
  if (length(previous_k_values) >= k) res <- mean(previous_k_values)
  res
}

# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
  res = tmp.df %>% 
    mutate(
      y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
    ) 
  return(res)
}

# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3), 
          roll_mean_benchmark(), 
          columns=c('test','elapsed','replications'),
          replications = 100)

Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package. In my case the benchmark runtimes that are printed after running the code are:

                                 test elapsed replications
2               roll_mean_benchmark()   4.463          100
1 roll_mean_k_efficient(x = x, k = 3)   0.039          100

Without using zoo . In tidyverse fashion, you can also do it using purrr::map


tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3, 
                                                 mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)), 
                                                 NA)))

    x  y
1  NA NA
2   1 NA
3   2 NA
4  NA NA
5   3 NA
6   4  2
7   5  3
8  NA  4
9  NA  4
10 NA  4
11  6  4
12  7  5
13 NA  6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM