简体   繁体   English

在没有For循环的情况下替换或估算R中的NA值

[英]Replacing or imputing NA values in R without For Loop

Is there a better way to go through observations in a data frame and impute NA values? 有没有更好的方法来检查数据帧中的观测值并估算NA值? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function? 我已经拼凑了一个似乎可以完成工作的“ for循环”,将NA与该行的平均值交换,但是我想知道是否有更好的方法不使用for循环来解决此问题-也许内置的R函数?

# 1. Create data frame with some NA values. 

rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)  
df2 <- df

# 2. Run for loop to replace NAs with that row's mean.

for(i in 1:3){            # for every row
x <- as.numeric(df[i,])   # subset/extract that row into a numeric vector
y <- is.na(x)             # create logical vector of NAs
z <- !is.na(x)            # create logical vector of non-NAs
result <- mean(x[z])      # get the mean value of the row 
df2[i,y] <- result        # replace NAs in that row
}

# 3. Show output with imputed row mean values.

print(df)  # before
print(df2) # after 

Here's a possible vectorized approach (without any loop) 这是一种可能的矢量化方法(无任何循环)

indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]

Some explanation 一些解释

We can identify the locations of the NA s using the arr.ind parameter in which . 我们可以识别的位置NA使用S arr.ind参数which Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly 然后我们可以简单地索引df (通过行和列索引)和行均值(仅通过行索引)并相应地替换值

One possibility, using impute from Hmisc , which allows for choosing any function to do imputation, 一种可能性是使用Hmisc impute ,它允许选择任何函数进行插补,

library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))

Also, you can hide the loop in an apply 另外,您可以在apply 隐藏循环

t(apply(df2, 1, function(x) {
    mu <- mean(x, na.rm=T)
    x[is.na(x)] <- mu
    x
}))

Data: 数据:

set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)

This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. 这比我想要的要复杂一些-它依赖于R中矩阵的列主要排序以及行均值向量到矩阵全长的循环。 I tried to come up with a sweep() solution but didn't manage so far. 我试图提出一个sweep()解决方案,但到目前为止还没有解决。

rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM