简体   繁体   English

将NA替换为data.table的同一列的平均值

[英]Replace NAs with mean of the same column of a data.table

I want to replace NAs present in a column of a DATA TABLE with the mean of the same column. 我想用同一列的平均值替换DATA TABLE列中的NAs。 I am doing the following. 我正在做以下事情。 But it is not working. 但它没有用。

ww <- data.table(iris)

ww <- ww[1:5 , ]

ww[1,1] <- NA

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:           NA         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa


ww[is.na(Sepal.Length) , Sepal.Length:= mean(Sepal.Length, na.rm = T)]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          NaN         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa

Why am I getting NaN in place of NA when it should have been the mean of the rest of the values (4.9, 4.7, 4.6, 5.0)? 为什么我应该用NaN代替NA,它应该是其余值的平均值(4.9,4.7,4.6,5.0)?

What is the alternate of acheiving this in case something is wrong with this syntax? 如果这种语法有问题,那么实现这一点的替代方法是什么?

I want to the syntax for the data table. 我想要数据表的语法。

na.aggregate in the zoo package replaces NAs with the mean of the non-NAs in the same column: 动物园包中的na.aggregatena.aggregate替换为同一列中非NA的平均值:

library(zoo)

ww[, Sepal.Length := na.aggregate(Sepal.Length)]

While the zoo answer is pretty nice it requires new dependency. 虽然zoo答案非常好,但它需要新的依赖性。
Using just data.table you could do the following. 只使用data.table您可以执行以下操作。

library(data.table)

# prepare data
ww = data.table(iris[1:5,])
ww[1, Sepal.Length := NA]

# solution
ww[, Sepal.Length.mean := mean(Sepal.Length, na.rm = TRUE) # calculate mean
   ][is.na(Sepal.Length), Sepal.Length := Sepal.Length.mean # replace NA with mean
     ][, Sepal.Length.mean := NULL # remove mean col
       ][] # just prints

While it may looks biggish comparing to zoo's, it is performance efficient as all steps are made using update by reference := . 虽然与动物园相比看起来可能看起来很大,但它具有高效性,因为所有步骤都是通过引用更新来实现的 := It can also be easily tuned to replace NA with mean by group, just using by argument in data.table. 它也可以很容易地调整为使用mean by group替换NA,只需使用data.table中by参数。

Your attempt subsetted the table first, selecting 您的尝试首先对表进行子集化,然后选择

> ww[is.na(Sepal.Length)]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:   

    NA         3.5          1.4         0.2  setosa

so any further operations can only 'see' these rows - ie Sepal.Length can only see that one NA . 所以任何进一步的操作只能“看到”这些行 - 即Sepal.Length只能看到一个NA

The data.table solution you want is below - it looks at the whole table and replaces the NA s with the means using an ifelse . 您想要的data.table解决方案如下 - 它查看整个表并使用ifelse替换NA s。

ww[, Sepal.Length := ifelse(is.na(Sepal.Length), mean(Sepal.Length, na.rm = TRUE), Sepal.Length)]

在基地R:

ww$Sepal.Length[is.na(ww$Sepal.Length)] <- mean(ww$Sepal.Length, na.rm = T)

It is not taking the mean of the entire Sepal.Length column; 它没有采用整个Sepal.Length列的平均值; only the 1 column that you have chosen. 只有您选择的1列。

Rather use: 而是使用:

ww[is.na(Sepal.Length) , Sepal.Length:= mean(ww$Sepal.Length, na.rm=TRUE)]

tidyr has a built in function, replace_na you can use for this: tidyr有一个内置函数,你可以使用replace_na

library(tidyr)
ww %>% replace_na(list(Sepal.Length = mean(.$Sepal.Length, na.rm = TRUE)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM