简体   繁体   English

通过填写缺失的日期并通过上下对称迭代日期以找到 r 中可用的最接近值来平均插补

[英]mean imputation by filling in missing dates and by symetrically iterating over dates up and down to find the closest value available in r

I need to impute all missing dates between the available dates for each id's and then go symmetrically up and down to impute missing.我需要在每个 id 的可用日期和 go 之间上下对称地估算所有缺失的日期以估算缺失。 Also, not always I need the average between two, eg: when I go 2 dates up and down and I see only 1 value, then I would impute that value.此外,我并不总是需要两者之间的平均值,例如:当我 go 2 上下日期并且我只看到 1 个值时,我会估算该值。

df1 <- data.frame(id = c(11,11,11,11,11,11,11,11),
                  Date = c("2021-06-01", "2021-06-05", "2021-06-08", "2021-06-09", "2021-06-14", "2021-06-16", "2021-06-20", "2021-06-21"),
                  price = c(NA, NA,100, NA, 50, NA, 200, NA)
)

There is an excellent solution for missing imputation on a symmetrical iteration by @lovalery how to groupby and take mean of value by symetrically looping forward and backward on the date value in r @lovalery 如何通过在 r 中的日期值上对称地向前和向后循环来分组并取平均值

In the above solution, the date present is used, but this can be an issue when there is a large number of dates missing in between.在上述解决方案中,使用了当前日期,但是当两者之间缺少大量日期时,这可能是一个问题。 Hence I wanted to insert all missing dates in between and then symmetrically move in both directions until I get at least 1 value in either direction, I need to retain it, if 2 values I need the mean.因此,我想在两者之间插入所有缺失的日期,然后在两个方向上对称地移动,直到我在任一方向上至少得到 1 个值,我需要保留它,如果 2 个值我需要平均值。

在此处输入图像描述

Please find below with a reprex one possible solution using the data.table and padr libraries.请在下面找到使用data.tablepadr库的一种可能的解决方案。

I built a function to make it easier to use.我构建了一个 function 以使其更易于使用。

Reprex代表

  • Code of the NA_imputations_dates() function NA_imputations_dates()的代码 function
library(data.table)
library(padr)

NA_imputations_dates <- function(x) {
  
  setDT(x)[, Date := as.Date(Date)]
  
  x <- pad(x, interval = "day", group = "id")
  
  setDT(x)[, rows := .I]
  
  z <- x[, .I[!is.na(price)]]
  
  id_1 <- z[-length(z)]
  id_2 <- z[-1]
  
  values <- x[z, .(price = price, id = id)]
  values_1 <- values[-nrow(values)]
  names(values_1) <- c("price_1", "id_o1")
  values_2 <- values[-1]
  names(values_2) <- c("price_2", "id_o2")
  
  subtract <- z[-1] - z[-length(z)]
  
  r <- data.table(id_1, values_1, id_2, values_2, subtract)
  
  r <- r[, `:=` (id_mean = fifelse(subtract > 2 & subtract %% 2 == 0, id_1+(subtract/2), (id_1+id_2)/2),
                 mean = fifelse(subtract >= 2 & subtract %% 2 == 0 & id_o1 == id_o2, (price_1+price_2)/2, NA_real_))
         ][, `:=` (price_1 = NULL, id_1 = NULL, id_o1 = NULL, id_2 = NULL, price_2 = NULL, id_o2 = NULL, subtract = NULL)
           ][x, on = .(id_mean = rows)][, dummy := cumsum(!is.na(mean)), by = .(id)]
  
  h <-  r[, .(price = na.omit(price)), by = .(dummy)]
  
  Results <- r[, price := NULL
               ][h, on = .(dummy)
                 ][, price := fifelse(!is.na(mean), mean, price)
                   ][, `:=` (id_mean = NULL, mean = NULL, dummy = NULL)][]
  
  return(Results)
}
  • Output of the NA_imputations_dates() function NA_imputations_dates NA_imputations_dates() function 的 Output
NA_imputations_dates(df1)
#>     id       Date price
#>  1: 11 2021-06-01   100
#>  2: 11 2021-06-02   100
#>  3: 11 2021-06-03   100
#>  4: 11 2021-06-04   100
#>  5: 11 2021-06-05   100
#>  6: 11 2021-06-06   100
#>  7: 11 2021-06-07   100
#>  8: 11 2021-06-08   100
#>  9: 11 2021-06-09   100
#> 10: 11 2021-06-10   100
#> 11: 11 2021-06-11    75
#> 12: 11 2021-06-12    50
#> 13: 11 2021-06-13    50
#> 14: 11 2021-06-14    50
#> 15: 11 2021-06-15    50
#> 16: 11 2021-06-16    50
#> 17: 11 2021-06-17   125
#> 18: 11 2021-06-18   200
#> 19: 11 2021-06-19   200
#> 20: 11 2021-06-20   200
#> 21: 11 2021-06-21   200
#>     id       Date price

Created on 2021-12-12 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2021 年 12 月 12 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM