用最接近的值替换 R 中的 NA

Question

我在zoo package 中寻找类似于na.locf()的东西，但我不想总是使用以前的非NA值，而是想使用最近的非NA值。 一些示例数据：

dat <- c(1, 3, NA, NA, 5, 7)

用na.locf替换NA （3 被结转）：

library(zoo)
na.locf(dat)
# 1 3 3 3 5 7

以及na.locf设置为TRUE的fromLast （5 向后进位）：

na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7

但我希望使用最接近的非NA值。 在我的例子中，这意味着 3 应该被传递到第一个NA ，而 5 应该被传递到第二个NA ：

1 3 3 5 5 7

我编写了一个解决方案，但想确保我没有重新发明轮子。 有什么东西已经漂浮在周围了吗？

仅供参考，我当前的代码如下。 也许如果不出意外，有人可以建议如何提高效率。 我觉得我缺少一种明显的改进方法：

  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) {
    return(which.min(abs(non.na.pos - x)))
  })
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]

下面回答smci的问题：

不，任何条目都可以是 NA
如果全部为 NA，请保持原样
不，我当前的解决方案默认为左侧最接近的值，但这并不重要
这些行通常有几十万个元素，所以理论上上限是几十万个。 实际上，这里和那里只有几个，通常是一个。

更新所以事实证明我们正朝着完全不同的方向前进，但这仍然是一个有趣的讨论。 谢谢大家！

Answer 1

这是一个非常快的。 它使用findInterval来查找原始数据中每个NA应考虑的两个位置：

f1 <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}

在这里我测试它：

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat))    # your function
# user  system elapsed 
# 5.52    0.00    5.52
system.time(r1 <- f1(dat))    # this function
# user  system elapsed 
# 0.01    0.00    0.03
identical(r0, r1)
# [1] TRUE

Answer 2

下面的代码。 最初的问题没有完全明确定义，我曾要求这些澄清：

是否保证至少第一个和/或最后一个条目是非 NA 的？ [不]
如果一行中的所有条目都是 NA 怎么办？ [保持原样]
您是否关心关系如何拆分，即如何处理1 3 NA NA NA 5 7中的中间 NA？ [无所谓/离开]
连续 NA 的最长连续跨度是否有上限 (S)？ （如果 S 很小，我正在考虑一个递归解决方案。或者如果 S 很大并且行数和列数很大，则使用ifelse的 dataframe 解决方案。） [最坏情况下 S 可能在病态上很大，因此不应使用递归]

geoffjentry，关于你的解决方案，你的瓶颈将是nearest.non.na.pos的串行计算和串行分配dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]] For一个长度为 G 的大间隙，我们真正需要计算的是第一个（G/2，向上舍入）项目从左侧填充，rest 从右侧填充。 （我可以使用ifelse发布答案，但它看起来很相似。）你的标准是runtime 、big-O efficiency、temp memory usage 还是代码易读性？

Coupla 可能的调整：

只需要计算N <- length(dat)一次
常见情况速度增强： if (length(na.pos) == 0)跳过行，因为它没有 NA
if (length(na.pos) == length(dat)-1)只有一个非 NA 条目的（罕见）情况因此我们用它填充整行

大纲解决方案：

遗憾的是 na.locf 不适用于整个 dataframe，您必须按行使用 sapply：

na.fill_from_nn <- function(x) {
  row.na <- is.na(x)
  fillFromLeft <- na.locf(x, na.rm=FALSE) 
  fillFromRight <- na.locf(x, fromLast=TRUE, na.rm=FALSE)

  disagree <- rle(fillFromLeft!=fillFromRight)
  for (loc in (disagree)) { ...  resolve conflicts, row-wise }
}

sapply(dat, na.fill_from_nn)

或者，由于正如您所说的那样，连续的 NA 很少见，因此请执行快速而愚蠢的ifelse从左侧填充孤立的 NA。 这将明智地操作数据帧=>使常见情况更快。 然后使用逐行 for 循环处理所有其他情况。 （这会影响 NA 的长跨度中中间元素的决胜局，但你说你不在乎。）

Answer 3

我想不出一个明显的简单解决方案，但是，在查看了建议（特别是smci关于使用 rle 的建议）后，我想出了一个复杂的rle ，它似乎更有效。

这是代码，我将在下面解释：

# Your function
your.func = function(dat) {
  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) which.min(abs(non.na.pos - x)))
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
  dat
}

# My function
my.func = function(dat) {
    nas=is.na(dat)
    if (!any(!nas)) return (dat)
    t=rle(nas)
    f=sapply(t$lengths[t$values],seq)
    a=unlist(f)
    b=unlist(lapply(f,rev))
    x=which(nas)
    l=length(dat)
    dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
    dat
}


# Test
n = 100000
test.vec = 1:n
set.seed(1)
test.vec[sample(test.vec,n/4)]=NA

system.time(t1<-my.func(test.vec))
system.time(t2<-your.func(test.vec)) # 10 times speed improvement on my machine

# Verify
any(t1!=t2)

我的rle依赖于 rle 。 我正在阅读上面的评论，但在我看来rle对NA工作得很好。 用一个小例子来解释是最容易的。

如果我从一个向量开始：

dat=c(1,2,3,4,NA,NA,NA,8,NA,10,11,12,NA,NA,NA,NA,NA,18)

然后我得到所有 NA 的位置：

x=c(5,6,7,8,13,14,15,16,17)

然后，对于 NA 的每次“运行”，我创建一个从 1 到运行长度的序列：

a=c(1,2,3,1,1,2,3,4,5)

然后我再做一次，但我颠倒了顺序：

b=c(3,2,1,1,5,4,3,2,1)

现在，我可以只比较向量 a 和 b：如果 a<=b 则回头查看并获取 xa 处的值。 如果 a>b 则向前看并获取 x+b 处的值。 rest 仅处理当所有 NA 或 NA 在向量的末尾或开头运行时的极端情况。

可能有更好、更简单的解决方案，但我希望这能让您入门。

Answer 4

我喜欢所有严格的解决方案。 虽然不是直接询问的内容，但我发现这篇文章正在寻找一种用插值填充 NA 值的解决方案。 在查看这篇文章后，我在zoo对象（向量、因子或矩阵）上发现了 na.fill：

z <- c(1,2,3,4,5,6,NA,NA,NA,2,3,4,5,6,NA,NA,4,6,7,NA)
z1 <- zoo::na.fill(z, "extend")

注意 NA 值之间的平滑过渡

round(z1, 0)
#>  [1] 1 2 3 4 5 6 5 4 3 2 3 4 5 6 5 5 4 6 7 7

也许这可以帮助

Answer 5

这是我的尝试。 我从不喜欢在 R 中看到 for 循环，但在稀疏 NA 向量的情况下，它看起来实际上会更有效（下面的性能指标）。 代码的要点如下。

  #get the index of all NA values
  nas <- which(is.na(dat))

  #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
  namask <- is.na(dat)

  #calculate the maximum size of a run of NAs
  length <- getLengthNAs(dat);

  #the furthest away an NA value could be is half of the length of the maximum NA run
  windowSize <- ceiling(length/2)

  #loop through all NAs
  for (thisIndex in nas){
    #extract the neighborhood of this NA
    neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
    #any already-filled-in values which were NA can be replaced with NAs
    neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA

    #the center of this neighborhood
    center <- windowSize + 1

    #compute the difference within this neighborhood to find the nearest non-NA value
    delta <- center - which(!is.na(neighborhood))

    #find the closest replacement
    replacement <- delta[abs(delta) == min(abs(delta))]
    #in case length > 1, just pick the first
    replacement <- replacement[1]

    #replace with the nearest non-NA value.
    dat[thisIndex] <- dat[(thisIndex - (replacement))]
  }

我喜欢您提出的代码，但我注意到我们正在计算矩阵中每个 NA 值与每个其他非 NA 索引之间的增量。 我认为这是最大的性能消耗。 相反，我只是提取每个 NA 周围的最小大小邻域或 window，并在该 window 中找到最近的非 NA 值。

因此，性能与 NA 的数量和 window 大小呈线性关系——其中 window 大小是 NA 最大运行长度的（上限）一半。 要计算 NA 的最大运行长度，可以使用以下 function：

getLengthNAs <- function(dat){
  nas <- which(is.na(dat))
  spacing <- diff(nas)
  length <- 1;
  while (any(spacing == 1)){        
    length <- length + 1;
    spacing <- diff(which(spacing == 1))
  }
    length
}

性能比较

#create a test vector with 10% NAs and length 50,000.
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

#the a() function is the code posted in the question
a <- function(dat){
  na.pos <- which(is.na(dat))
    if (length(na.pos) == length(dat)) {
        return(dat)
    }
    non.na.pos <- setdiff(seq_along(dat), na.pos)
    nearest.non.na.pos <- sapply(na.pos, function(x) {
        return(which.min(abs(non.na.pos - x)))
    })
    dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
    dat
}

#my code
b <- function(dat){
    #the same code posted above, but with some additional helper code to sanitize the input
    if(is.null(dat)){
      return(NULL);
    }

    if (all(is.na(dat))){
      stop("Can't impute NAs if there are no non-NA values.")
    }

    if (!any(is.na(dat))){
      return(dat);
    }

    #starts with an NA (or multiple), handle these
    if (is.na(dat[1])){
      firstNonNA <- which(!is.na(dat))[1]
      dat[1:(firstNonNA-1)] <- dat[firstNonNA]
    }

    #ends with an NA (or multiple), handle these
    if (is.na(dat[length(dat)])){
      lastNonNA <- which(!is.na(dat))
      lastNonNA <- lastNonNA[length(lastNonNA)]
      dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA]
    }

    #get the index of all NA values
    nas <- which(is.na(dat))

    #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
    namask <- is.na(dat)

    #calculate the maximum size of a run of NAs
    length <- getLengthNAs(dat);

    #the furthest away an NA value could be is half of the length of the maximum NA run
    #if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case.
    windowSize <- ceiling(length/2)

    #loop through all NAs
    for (thisIndex in nas){
      #extract the neighborhood of this NA
      neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
      #any already-filled-in values which were NA can be replaced with NAs
      neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA

      #the center of this neighborhood
      center <- windowSize + 1

      #compute the difference within this neighborhood to find the nearest non-NA value
      delta <- center - which(!is.na(neighborhood))

      #find the closest replacement
      replacement <- delta[abs(delta) == min(abs(delta))]
      #in case length > 1, just pick the first
      replacement <- replacement[1]

      #replace with the nearest non-NA value.
      dat[thisIndex] <- dat[(thisIndex - (replacement))]
    }
    dat
}

#nograpes' answer on this question
c <- function(dat){
  nas=is.na(dat)
  if (!any(!nas)) return (dat)
  t=rle(nas)
  f=sapply(t$lengths[t$values],seq)
  a=unlist(f)
  b=unlist(lapply(f,rev))
  x=which(nas)
  l=length(dat)
  dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
  dat
}

#run 10 times each to get average performance.
sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10)
A:  5.059
sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10)
B:  0.126
sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10)
C:  0.287

所以看起来这段代码（至少在这些条件下）比问题中发布的原始代码提供了大约 40 倍的加速，并且比下面rle的答案提供了 2.2 倍的加速（尽管我认为 rle 解决方案肯定会更快在某些情况下——包括更富含 NA 的载体）。

Answer 6

速度比所选答案慢 3-4 倍。 我的虽然很简单。 这也是一个罕见的 while 循环。

f2 <- function(x){

  # check if all are NA to skip loop
  if(!all(is.na(x))){

    # replace NA's until they are gone
    while(anyNA(x)){

      # replace from the left
      x[is.na(x)] <- c(NA,x[1:(length(x)-1)])[is.na(x)]

      # replace from the right
      x[is.na(x)] <- c(x[-1],NA)[is.na(x)]
    }
  }

  # return original or fixed x
  x
}

用最接近的值替换 R 中的 NA

问题描述

6 个解决方案

解决方案1
27 已采纳 2012-04-10 01:30:04

解决方案2
6 2012-04-09 19:37:06

解决方案3
4 2012-04-09 21:52:56

解决方案4
4 2017-04-13 09:11:51

解决方案5
2 2012-04-10 00:10:51

性能比较

解决方案6
1 2015-08-05 05:09:32

用最接近的值替换 R 中的 NA

问题描述

6 个解决方案

解决方案1 27 已采纳 2012-04-10 01:30:04

解决方案2 6 2012-04-09 19:37:06

解决方案3 4 2012-04-09 21:52:56

解决方案4 4 2017-04-13 09:11:51

解决方案5 2 2012-04-10 00:10:51

性能比较

解决方案6 1 2015-08-05 05:09:32

解决方案1
27 已采纳 2012-04-10 01:30:04

解决方案2
6 2012-04-09 19:37:06

解决方案3
4 2012-04-09 21:52:56

解决方案4
4 2017-04-13 09:11:51

解决方案5
2 2012-04-10 00:10:51

解决方案6
1 2015-08-05 05:09:32