简体   繁体   English

在非 NA 观测值之间进行插值

[英]Interpolate between non-NA observations

Consider observations at irregular snapshots, some of which are NA:考虑对不规则快照的观察,其中一些是不适用的:

library(tidyverse)
library(tweenr)
df <- data.frame(date = c(ymd("20191201"), ymd("20191203"), ymd("20191207"), ymd("20191220")),
                 value = c(1, 2, NA, 5))

What is the cleanest way to linearly interpolate dates only between observations with non-NA values ?仅在具有非 NA 值的观测值之间线性插值日期的最简洁方法是什么? (In this example since 20191201 and 20191203 have consecutive non-NA values, there should be interpolation) I think somehow using lead or lag . (在这个例子中,因为 20191201 和 20191203 有连续的非 NA 值,应该有插值)我想以某种方式使用leadlag This code interpolates between all values:此代码在所有值之间进行插值:

all_days <- data.frame(date = seq(min(df$date), max(df$date), "day"))
df %>% 
  arrange(date) %>%
  right_join(all_days) %>%
  mutate(value = value %>% tween_fill("linear"))

We can create a new column to mark dates that are between non-NA values which we don't want to interpolate ( temp ).我们可以创建一个新列来标记我们不想插入的非 NA 值之间的日期( temp )。 Use complete to fill the missing sequence of dates and fill the temp column and use na.approx to interpolate values.使用complete填充缺失的日期序列并fill temp列并使用na.approx插入值。

library(tidyr)
library(zoo)
library(dplyr)

df %>%
  mutate(temp = +(!(is.na(value) | lead(is.na(value), default = TRUE)))) %>%
  complete(date = seq(min(date), max(date), by = "day")) %>%
  fill(temp) %>%
  mutate(temp = replace(temp, !is.na(value), 1),
        value = na.approx(value) * temp) %>%
  na_if(0) %>% select(-temp)


# A tibble: 20 x 2
#   date       value
#   <date>     <dbl>
# 1 2019-12-01   1  
# 2 2019-12-02   1.5
# 3 2019-12-03   2  
# 4 2019-12-04  NA  
# 5 2019-12-05  NA  
# 6 2019-12-06  NA  
# 7 2019-12-07  NA  
# 8 2019-12-08  NA  
# 9 2019-12-09  NA  
#10 2019-12-10  NA  
#11 2019-12-11  NA  
#12 2019-12-12  NA  
#13 2019-12-13  NA  
#14 2019-12-14  NA  
#15 2019-12-15  NA  
#16 2019-12-16  NA  
#17 2019-12-17  NA  
#18 2019-12-18  NA  
#19 2019-12-19  NA  
#20 2019-12-20  5  

Here is my envisioned solution.这是我设想的解决方案。 The main idea is to create a mask which determines which values will be interpolated.主要思想是创建一个掩码来确定将插入哪些值。 To create the mask, we mark a row as TRUE if both the row and the next row have non-NA value, then use complete and fill to fill in between.要创建掩码,我们将一行标记为 TRUE,如果该行和下一行都具有非 NA 值,然后使用completefill填充它们之间。 To complete the mask we set the last contiguous observation to TRUE.为了完成掩码,我们将最后一个连续观察设置为 TRUE。

df %>%
  mutate(has_value = !is.na(value),
         mask = lead(has_value, default = FALSE) & has_value) %>%
  complete(date = seq(min(date), max(date), by = "day"),
           fill = list(has_value = FALSE)) %>%
  fill(mask) %>%
  mutate(mask = mask | has_value,
         value = if_else(mask, value %>% tween_fill("linear"), NA_real_)) %>%
  select(-has_value, -mask)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM