简体   繁体   中英

linear interpolation of NA values in R

I have a table test whose NA values I would like to approximate based on linear interpolation between values that do exist.

For example, the second row plotted looks like this:

v1 <- unlist(test[2,])
plot(v1[!is.na(v1)], names(v1)[!is.na(v1)], type="l", add = TRUE)

在此处输入图像描述

How would one go about interpolating/approximating the NA values along the x-axis in this case? Any suggestions in base R or dplyr would be helpful

test
  variable 26500 30000 30100 30700 31600 33700 33800 33900 34000 34600 34800 35100 35200 35300
1      -20    NA     0    NA    NA    10    20    NA    NA    NA    30    NA    NA    NA    NA
2      -10    NA     0    NA    NA    NA    10    NA    NA    NA    20    NA    NA    NA    30
3        0     0    NA    NA    NA    NA    NA    10    NA    NA    NA    20    NA    NA    NA
4       24    NA    NA    NA     0    NA    NA    NA    NA    10    NA    NA    NA    20    NA
5       40    NA    NA     0    NA    NA    NA    NA    10    NA    NA    NA    20    NA    NA
6       55    NA    NA     0    NA    NA    NA    NA    10    NA    NA    NA    20    NA    NA
  35400 35600 35800 35900 36200 36300 36400 36700 36900 37000 37200 37800 37900 38000 38200
1    40    NA    NA    NA    50    NA    NA    NA    NA    NA    60    NA    NA    NA    70
2    NA    NA    NA    40    NA    NA    NA    50    NA    NA    NA    60    NA    NA    NA
3    NA    30    NA    NA    40    NA    NA    NA    50    NA    NA    NA    60    NA    NA
4    NA    NA    30    NA    NA    40    NA    NA    NA    50    NA    NA    NA    60    NA
5    NA    NA    30    NA    NA    40    NA    NA    NA    50    NA    NA    NA    NA    60
6    NA    NA    NA    30    NA    NA    40    NA    NA    50    NA    NA    NA    NA    60
  38800 39000 39100 39200 39700 39800 39900 40000 40200 40600 40700 40800 41700 41800
1    NA    NA    NA    80    NA    NA    NA    NA    90    NA    NA    NA   100    NA
2    70    NA    NA    NA    80    NA    NA    NA    NA    90    NA    NA   100    NA
3    70    NA    NA    NA    NA    80    NA    NA    NA    NA    90    NA   100    NA
4    NA    70    NA    NA    NA    NA    NA    80    NA    NA    NA    90   100    NA
5    NA    NA    70    NA    NA    NA    NA    80    NA    NA    NA    90    NA   100
6    NA    70    NA    NA    NA    NA    80    NA    NA    NA    NA    90   100    NA

Here is the sample data:

dput(test)
structure(list(variable = c(-20, -10, 0, 24, 40, 55), `26500` = c(NA, 
NA, 0L, NA, NA, NA), `30000` = c(0L, 0L, NA, NA, NA, NA), `30100` = c(NA, 
NA, NA, NA, 0L, 0L), `30700` = c(NA, NA, NA, 0L, NA, NA), `31600` = c(10L, 
NA, NA, NA, NA, NA), `33700` = c(20L, 10L, NA, NA, NA, NA), `33800` = c(NA, 
NA, 10L, NA, NA, NA), `33900` = c(NA, NA, NA, NA, 10L, 10L), 
    `34000` = c(NA, NA, NA, 10L, NA, NA), `34600` = c(30L, 20L, 
    NA, NA, NA, NA), `34800` = c(NA, NA, 20L, NA, NA, NA), `35100` = c(NA, 
    NA, NA, NA, 20L, 20L), `35200` = c(NA, NA, NA, 20L, NA, NA
    ), `35300` = c(NA, 30L, NA, NA, NA, NA), `35400` = c(40L, 
    NA, NA, NA, NA, NA), `35600` = c(NA, NA, 30L, NA, NA, NA), 
    `35800` = c(NA, NA, NA, 30L, 30L, NA), `35900` = c(NA, 40L, 
    NA, NA, NA, 30L), `36200` = c(50L, NA, 40L, NA, NA, NA), 
    `36300` = c(NA, NA, NA, 40L, 40L, NA), `36400` = c(NA, NA, 
    NA, NA, NA, 40L), `36700` = c(NA, 50L, NA, NA, NA, NA), `36900` = c(NA, 
    NA, 50L, NA, NA, NA), `37000` = c(NA, NA, NA, 50L, 50L, 50L
    ), `37200` = c(60L, NA, NA, NA, NA, NA), `37800` = c(NA, 
    60L, NA, NA, NA, NA), `37900` = c(NA, NA, 60L, NA, NA, NA
    ), `38000` = c(NA, NA, NA, 60L, NA, NA), `38200` = c(70L, 
    NA, NA, NA, 60L, 60L), `38800` = c(NA, 70L, 70L, NA, NA, 
    NA), `39000` = c(NA, NA, NA, 70L, NA, 70L), `39100` = c(NA, 
    NA, NA, NA, 70L, NA), `39200` = c(80L, NA, NA, NA, NA, NA
    ), `39700` = c(NA, 80L, NA, NA, NA, NA), `39800` = c(NA, 
    NA, 80L, NA, NA, NA), `39900` = c(NA, NA, NA, NA, NA, 80L
    ), `40000` = c(NA, NA, NA, 80L, 80L, NA), `40200` = c(90L, 
    NA, NA, NA, NA, NA), `40600` = c(NA, 90L, NA, NA, NA, NA), 
    `40700` = c(NA, NA, 90L, NA, NA, NA), `40800` = c(NA, NA, 
    NA, 90L, 90L, 90L), `41700` = c(100L, 100L, 100L, 100L, NA, 
    100L), `41800` = c(NA, NA, NA, NA, 100L, NA)), row.names = c(NA, 
-6L), class = "data.frame")

We could use na.interp from forecast

library(forecast)
test[-1] <- t(apply(test[-1], 1, na.interp))

Or with na.approx

test[-1] <- t(apply(test[-1], 1, na.approx, na.rm = FALSE))

then do the plotting

v1 <- unlist(test[2, -1])
plot(v1, names(v1), type = 'l')

在此处输入图像描述

If you want to switch easily between different interpolation methods (or time series imputation methods in general) you can also use the imputeTS package.

For the requested solution this would be:

library("imputeTS")
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "linear"))

Switching to Spline interpolation would look like this:

test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "stine"))

Another option could be Stineman interpolation:

test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "spline"))

Other imputation methods like na_ma (moving average imputation), na_kalman (Kalman smoothing on structural time series models) would be also possible, if you replace the na_interpolation with the specific function ( see also GitHub package Readme for a imputation function overview).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM