简体   繁体   中英

Interpolate values in a dataframe column inplace using dplyr

I'm working with a dataframe that has some missing data, and I need to interpolate the empty values, using linear interpolation.

Althoug I know I can do this with a loop, I'd like to do it using dplyr (for consistency and readibility and because I know that loops are awfully ugly in R ).

Here's an example of what I am trying to do:

data.raw <- tibble(x=c(66, 67, 68, 69, 70, 72, 73, 75, 93), 
                   S=c(0.11755811, 0.11648940, 0.11542069, 0.11434199, 
                       0.11218459, 0.10996312, 0.10884104, 0.10767071, 
                       0.09228918))
# As you can see, there are some "holes" in the data. For example, the value
# for x = 71 is missing.

# I've created a new dataframe with all the values for x as this:
data.proc <- tibble(x=66:(data.raw %>% select(x) %>% pull() %>% max)) %>% 
  left_join(data.raw, by='x')

# Here's my non optimal 'for' solution:
for(x_ in data.proc$x) {
  if(is.na(data.proc[data.proc$x == x_, 'S'])) {
    # Get min and max values for x
    x.0 <- max(data.proc[data.proc$x < x_, 'x'])
    x.1 <- min(data.proc[data.proc$x > x_, 'x'])
    S.0 <- data.proc[data.proc$x == x.0, 'S']
    S.1 <- data.proc[data.proc$x == x.1, 'S']
    # Calculate the slope
    m <- (S.1 - S.0) / (x.1 - x.0)
    # Set the new value
    data.proc[data.proc$x == x_, 'S'] <- m * (x_ - x.0) + S.0
  }
}

So, my question is: Is there a way to do this directly with dplyr ? So far mi google-fu is failing me :(

You can use approx

library(tidyverse)
left_join(tibble(x = seq(min(data.raw$x), max(data.raw$x))), data.raw) %>%
    mutate(S = if_else(is.na(S), approx(x, S, x)$y, S))
## A tibble: 28 x 2
#       x     S
#   <dbl> <dbl>
# 1    66 0.118
# 2    67 0.116
# 3    68 0.115
# 4    69 0.114
# 5    70 0.112
# 6    71 0.111
# 7    72 0.110
# 8    73 0.109
# 9    74 0.108
#10    75 0.108
## … with 18 more rows

This assumes that (1) x is the set of integer values between min(data.raw$x) and max(data.raw$x) , and (2) you only want to inter polate values in that interval (not extra polate, in wich case you'd want to use something like lm ).

We can use complete from tidyr to fill missing values in x na.approx from zoo to interpolate NA values in S .

library(dplyr)
library(tidyr)

data.raw %>% complete(x = seq(min(x), max(x))) %>% mutate(S = zoo::na.approx(S))

# A tibble: 28 x 2
#      x     S
#   <dbl> <dbl>
# 1    66 0.118
# 2    67 0.116
# 3    68 0.115
# 4    69 0.114
# 5    70 0.112
# 6    71 0.111
# 7    72 0.110
# 8    73 0.109
# 9    74 0.108
#10    75 0.108
# … with 18 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM