R - 按组估算缺失值（线性/移动平均）

Question

I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.我有一个包含很多缺失值的大型数据集，我想通过组“名称”线性或移动平均来估算它。

d <-  data.frame(
  name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
  year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
  V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
  W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
  X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
  Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
  Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)

> d
   name year  V  W  X  Y  Z
1     a    1 NA 11 11 11 NA
2     a    2 21 NA 21 21 NA
3     a    3 31 31 NA 31 31
4     a    4 41 41 41 NA 41
5     b    1 11 11 NA NA 11
6     b    2 NA 21 21 21 NA
7     b    3 NA NA NA 31 31
8     b    4 41 NA 41 NA NA
9     c    1 NA NA 11 NA NA
10    c    2 NA NA NA 21 NA
11    c    3 NA 31 NA NA NA
12    c    4 41 NA NA NA NA

Hopefully the results can be as closed as the following:希望结果可以像下面这样关闭：

   name year  V  W  X  Y  Z
1     a    1 11 11 11 11 11
2     a    2 21 21 21 21 21
3     a    3 31 31 31 31 31
4     a    4 41 41 41 41 41
5     b    1 11 11 11 11 11
6     b    2 21 21 21 21 21
7     b    3 31 31 31 31 31
8     b    4 41 41 41 41 41
9     c    1 11 11 11 11 NA
10    c    2 21 21 21 21 NA
11    c    3 31 31 31 31 NA
12    c    4 41 41 41 41 NA

I found this and this .我找到了这个和这个。 Tried the following without groupby but it didn't work:在没有 groupby 的情况下尝试了以下操作，但没有成功：

data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))

imputeTS::na_ma(d, k = 2, weighting = "simple")

The first one gave an error as below:第一个报错如下：

Error in approxfun(seq_along(X), X) : 
  need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
 Error in approxfun(seq_along(X), X) : 
  need at least two non-NA values to interpolate

So I tried the second one and it keep loading for a long time and nothing happened.所以我尝试了第二个，它一直加载很长时间，但什么也没发生。 According to the reply from the first link,根据第一个链接的回复，

the package requires time series/vector input (that's why each column has to be called separately). package 需要时间序列/向量输入（这就是必须单独调用每一列的原因）。

Any help is greatly appreciated!任何帮助是极大的赞赏！

Answer 1

You can use zoo::na.spline -您可以使用zoo::na.spline -

library(dplyr)

d %>%
  group_by(name) %>%
  mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
  ungroup

#   name   year     V     W     X     Y     Z
#   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a         1    11    11    11    11    11
# 2 a         2    21    21    21    21    21
# 3 a         3    31    31    31    31    31
# 4 a         4    41    41    41    41    41
# 5 b         1    11    11    11    11    11
# 6 b         2    21    21    21    21    21
# 7 b         3    31    31    31    31    31
# 8 b         4    41    41    41    41    41
# 9 c         1    41    31    11    21    NA
#10 c         2    41    31    11    21    NA
#11 c         3    41    31    11    21    NA
#12 c         4    41    31    11    21    NA

For name , "c" I think it would be difficult to impute the missing values only from 1 number.对于name ， "c" ，我认为很难仅从 1 个数字中估算缺失值。

Answer 2

One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma or na_interpolation from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.我看到的一个问题是，您要估算的某些系列只有 1 个非 NA 值，因此na_ma或其他包中的na_ma或na_interpolation无法成功应用，因为这些至少需要 2 个非 NA 值。

That is why in this solution I created a impute_select function for you, that let's you choose, what to to when > 1 values or present, when exactly == 1 values are present or when there are only NAs.这就是为什么在这个解决方案中，我为您创建了一个impute_select function，让您选择，当> 1值或存在时，当恰好== 1值存在或只有 NA 时要做什么。

In this case, when > 1 values is present, it uses na_ma , but you could also use na_interpoltion or any other imputation function from imputeTS here.在这种情况下，当存在 > 1 个值时，它使用na_ma ，但您也可以在此处使用 imputeTS 中的 na_interpoltion 或任何其他插补 function。 When only 1 value is present, it uses na_locf since this method also works with only 1 value in the series.当仅存在 1 个值时，它使用na_locf ，因为此方法也仅适用于系列中的 1 个值。 When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)当系列中没有非 NA 值时，它使用 na_replace，只需将所有 NA 替换为默认值（我只是将其设置为 11）

By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.通过调整这个 function，您应该能够单独调整系列中不同数量的 NA 的插补。

library("imputeTS")

d <-  data.frame(
  name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
  year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
  V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
  W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
  X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
  Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
  Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)

impute_select <- function(x) {
  # select a method to use when more than 1 values are available
  if (sum(!is.na(x)) > 1) {
    result <- na_ma(x)
  }
  # Select value when only 1 value is in series
  if (sum(!is.na(x)) == 1) {
    result <- na_locf(x)
  }
  # Select method, when no non-NA value is present
  else {
    result <- na_replace(x, 11)
  }
}

# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))

d

This are the results (hopefully exactly what you wanted):这是结果（希望正是你想要的）：

 name year VWXYZ 1 a 1 11 11 11 11 11 2 a 2 21 11 21 21 11 3 a 3 31 31 11 31 31 4 a 4 41 41 41 11 41 5 b 1 11 11 11 11 11 6 b 2 11 21 21 21 11 7 b 3 11 11 11 31 31 8 b 4 41 11 41 11 11 9 c 1 11 11 11 11 11 10 c 2 21 21 21 21 21 11 c 3 31 31 31 31 31 12 c 4 41 41 41 41 41

R - 按组估算缺失值（线性/移动平均）

问题描述

2 个解决方案

解决方案1
0 2022-02-05 08:39:53

解决方案2
0 2022-02-06 14:34:10

R - 按组估算缺失值（线性/移动平均）

问题描述

2 个解决方案

解决方案1 0 2022-02-05 08:39:53

解决方案2 0 2022-02-06 14:34:10

解决方案1
0 2022-02-05 08:39:53

解决方案2
0 2022-02-06 14:34:10