简体   繁体   English

用R中2个最接近的行的平均值替换缺失值

[英]Replacing missing value with mean of 2 nearest rows in R

I have a datatable with missing value, and I want to replace it with the average of 2 nearest row. 我有一个缺少值的数据表,我想将其替换为最近的2行的平均值。

library(data.table)
A <- data.table(id = c(1:10),
                Value = c(1:3,NA,5:10))

> A
    id Value
 1:  1     1
 2:  2     2
 3:  3     3
 4:  4    NA
 5:  5     5
 6:  6     6
 7:  7     7
 8:  8     8
 9:  9     9
10: 10    10

For example, I want the NA to be replaced by the mean of row3 and row5, which is 4. 例如,我希望将NA替换为row3和row5的平均值,即4。

na.approx in the zoo package does that. zoo包中的na.approx做到这一点。 If there can be leading or trailing NA values and you want to: 如果可以有前导或尾随的NA值,并且您要:

  • extend the nearest non-NA values add the rule = 2 argument to na.approx or 扩展最接近的非NA值,将rule = 2参数添加到na.approx
  • leave those as NA add the na.rm = FALSE argument to na.approx . 将其保留为NA,将na.rm = FALSE参数添加到na.approx

See ?na.approx for further arguments. 有关更多参数,请参见?na.approx Other possibilities from the same package include na.spline (fill in with cubic spline fit), na.aggregate (mean of all non-NA values), na.locf (last value carried forward) and na.StructTS (seasonal Kalman filter). 同一包中的其他可能性包括na.spline (用三次样条拟合拟合), na.aggregate (所有非NA值的平均值), na.locf (结转最后一个值)和na.StructTS (季节性卡尔曼滤波器) 。

library(zoo)

A[, list(Value = na.approx(Value))]

giving: 赠送:

    Value
 1:     1
 2:     2
 3:     3
 4:     4
 5:     5
 6:     6
 7:     7
 8:     8
 9:     9
10:    10

I have made a function that work with more than one continuous NA in your data table. 我做了一个可以在数据表中使用多个连续NA的函数。

library(data.table)
A <- data.table(id = c(1:11),
            Value = c(1,5:6,NA,10:12,NA,NA,NA,6))



library(dplyr)
# Finding the maximum length of a stretch of contiguous NA's in the column
a<- max(diff(which(!is.na(A$Value)))-1)

# Repeating the for loop "a" times and breaking when all NA's have been filled
repeat{
for(i in 1:a){
A$Value[which(A$Value%in%NA)] <- ((lag(A$Value, 1)+lead(A$Value, i))/2)[which(A$Value%in%NA)]
}
if(any(is.na(A$Value)) ==FALSE) { break }
}

For every NA in the vector, the function inside the for loop calculates the mean of the value previous to NA and the next available one. 对于向量中的每个NA,for循环内的函数都会计算NA之前的值和下一个可用值的平均值。

This is definitely not the most elegant or efficient solution as there is a lot of repetition but i believe it works with more than one NA in the manner you desire. 这肯定不是最优雅或最有效的解决方案,因为存在很多重复,但是我相信它可以按照您期望的方式与多个NA一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM