[英]Replace NAs with mean of the same column of a data.table
I want to replace NAs present in a column of a DATA TABLE with the mean of the same column. 我想用同一列的平均值替换DATA TABLE列中的NAs。 I am doing the following. 我正在做以下事情。 But it is not working. 但它没有用。
ww <- data.table(iris)
ww <- ww[1:5 , ]
ww[1,1] <- NA
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: NA 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
ww[is.na(Sepal.Length) , Sepal.Length:= mean(Sepal.Length, na.rm = T)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: NaN 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
Why am I getting NaN in place of NA when it should have been the mean of the rest of the values (4.9, 4.7, 4.6, 5.0)? 为什么我应该用NaN代替NA,它应该是其余值的平均值(4.9,4.7,4.6,5.0)?
What is the alternate of acheiving this in case something is wrong with this syntax? 如果这种语法有问题,那么实现这一点的替代方法是什么?
I want to the syntax for the data table. 我想要数据表的语法。
na.aggregate
in the zoo package replaces NAs with the mean of the non-NAs in the same column: 动物园包中的na.aggregate
将na.aggregate
替换为同一列中非NA的平均值:
library(zoo)
ww[, Sepal.Length := na.aggregate(Sepal.Length)]
While the zoo
answer is pretty nice it requires new dependency. 虽然zoo
答案非常好,但它需要新的依赖性。
Using just data.table
you could do the following. 只使用data.table
您可以执行以下操作。
library(data.table)
# prepare data
ww = data.table(iris[1:5,])
ww[1, Sepal.Length := NA]
# solution
ww[, Sepal.Length.mean := mean(Sepal.Length, na.rm = TRUE) # calculate mean
][is.na(Sepal.Length), Sepal.Length := Sepal.Length.mean # replace NA with mean
][, Sepal.Length.mean := NULL # remove mean col
][] # just prints
While it may looks biggish comparing to zoo's, it is performance efficient as all steps are made using update by reference :=
. 虽然与动物园相比看起来可能看起来很大,但它具有高效性,因为所有步骤都是通过引用更新来实现的 :=
。 It can also be easily tuned to replace NA with mean by group, just using by
argument in data.table. 它也可以很容易地调整为使用mean by group替换NA,只需使用data.table中by
参数。
Your attempt subsetted the table first, selecting 您的尝试首先对表进行子集化,然后选择
> ww[is.na(Sepal.Length)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:
NA 3.5 1.4 0.2 setosa
so any further operations can only 'see' these rows - ie Sepal.Length
can only see that one NA
. 所以任何进一步的操作只能“看到”这些行 - 即Sepal.Length
只能看到一个NA
。
The data.table
solution you want is below - it looks at the whole table and replaces the NA
s with the means using an ifelse
. 您想要的data.table
解决方案如下 - 它查看整个表并使用ifelse
替换NA
s。
ww[, Sepal.Length := ifelse(is.na(Sepal.Length), mean(Sepal.Length, na.rm = TRUE), Sepal.Length)]
在基地R:
ww$Sepal.Length[is.na(ww$Sepal.Length)] <- mean(ww$Sepal.Length, na.rm = T)
It is not taking the mean of the entire Sepal.Length column; 它没有采用整个Sepal.Length列的平均值; only the 1 column that you have chosen. 只有您选择的1列。
Rather use: 而是使用:
ww[is.na(Sepal.Length) , Sepal.Length:= mean(ww$Sepal.Length, na.rm=TRUE)]
tidyr
has a built in function, replace_na
you can use for this: tidyr
有一个内置函数,你可以使用replace_na
:
library(tidyr)
ww %>% replace_na(list(Sepal.Length = mean(.$Sepal.Length, na.rm = TRUE)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.