简体   繁体   中英

R - Replace missing values with highest of 4 previous values

This is a variation of the last observation carried forward problem in a vector with some missing values. Instead of filling in NA values with the last non NA observation, I would like to fill in NA values with the highest value in the 4 observations preceding it. If all 4 observations preceding are also NA, the NA missing value should be retained. Would also appreciate it this can be done by groups in a data frame/data table.

Example:

    Original DF:     

    ID Week Value
    a  1    5
    a  2    1   
    a  3    NA  
    a  4    NA  
    a  5    3  
    a  6    4      
    a  7    NA  
    b  1    NA  
    b  2    NA  
    b  3    NA  
    b  4    NA  
    b  5    NA  
    b  6    1  
    b  7    NA

    Output DF:

    ID  Week  Value  
    a   1     5  
    a   2     1  
    a   3     5   
    a   4     5    
    a   5     3  
    a   6     4  
    a   7     4     
    b   1     NA  
    b   2     NA  
    b   3     NA  
    b   4     NA  
    b   5     NA  
    b   6     1  
    b   7     1

lag shifts the column by n steps and lets you peek at previous values. pmax is element-wise maximum and lets to pick the highest value for each set/row of the observations.

To abstract away notion of 4 and maintain vectorized performance, you may use quasiquotes from rlang: http://dplyr.tidyverse.org/articles/programming.html#quasiquotation

It can look a little cryptic at first but is very precise and expressive.


df <- readr::read_table(
"    ID Week Value
    a  1    5
    a  2    1   
    a  3    NA  
    a  4    NA  
    a  5    3  
    a  6    4      
    a  7    NA  
    b  1    NA  
    b  2    NA  
    b  3    NA  
    b  4    NA  
    b  5    NA  
    b  6    1  
    b  7    NA")

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%
  group_by(ID) %>%
  mutate(
    Value = if_else(is.na(Value), pmax(lag(Value, 1), lag(Value, 2), lag(Value, 3), lag(Value, 4), na.rm = TRUE), Value)
  )
#> # A tibble: 14 x 3
#> # Groups:   ID [2]
#>       ID  Week Value
#>    <chr> <int> <int>
#>  1     a     1     5
#>  2     a     2     1
#>  3     a     3     5
#>  4     a     4     5
#>  5     a     5     3
#>  6     a     6     4
#>  7     a     7     4
#>  8     b     1    NA
#>  9     b     2    NA
#> 10     b     3    NA
#> 11     b     4    NA
#> 12     b     5    NA
#> 13     b     6     1
#> 14     b     7     1

# or if you are an rlang ninja
library(purrr)
pmax_lag_n <- function(column, n) {
  column <- enquo(column)
  1:n %>%
    map(~quo(lag(!!column, !!.x))) %>%
    { quo(pmax(!!!., na.rm = TRUE)) }
}

df %>%
  group_by(ID) %>%
  mutate(Value = if_else(is.na(Value), !!pmax_lag_n(Value, 4), Value))
#> # A tibble: 14 x 3
#> # Groups:   ID [2]
#>       ID  Week Value
#>    <chr> <int> <int>
#>  1     a     1     5
#>  2     a     2     1
#>  3     a     3     5
#>  4     a     4     5
#>  5     a     5     3
#>  6     a     6     4
#>  7     a     7     4
#>  8     b     1    NA
#>  9     b     2    NA
#> 10     b     3    NA
#> 11     b     4    NA
#> 12     b     5    NA
#> 13     b     6     1
#> 14     b     7     1

Define function Max which accepts a vector x and returns NA if all its elements are NA. Otherwise, if the last value is NA it returns the maximum of all non-NA elements and if the last value is not NA then it returns it.

Also define na.max which runs Max on a rolling window of length n (given by the second argument to na.max -- default 5).

Finally apply na.max to Value by ID using ave .

library(zoo)

Max <- function(x) {
  last <- tail(x, 1)
  if (all(is.na(x))) NA
  else if (is.na(last)) max(x, na.rm = TRUE)
       else last
}

na.max <- function(x, n = 5) rollapplyr(x, n, Max, partial = TRUE)

transform(DF, Value = ave(Value, ID, FUN = na.max))

giving:

   ID Week Value
1   a    1     5
2   a    2     1
3   a    3     5
4   a    4     5
5   a    5     3
6   a    6     4
7   a    7     4
8   b    1    NA
9   b    2    NA
10  b    3    NA
11  b    4    NA
12  b    5    NA
13  b    6     1
14  b    7     1

Note: Input DF in reproducible form:

Lines <- "
ID Week Value
    a  1    5
    a  2    1   
    a  3    NA  
    a  4    NA  
    a  5    3  
    a  6    4      
    a  7    NA  
    b  1    NA  
    b  2    NA  
    b  3    NA  
    b  4    NA  
    b  5    NA  
    b  6    1  
    b  7    NA"
DF <- read.table(text = Lines, header = TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM