[英]Impute missing values with the average of the remainder
I have a data frame of the form: 我有一个形式的数据框:
Weight Day Hour
NA M 0
NA M 1
2 M 2
1 M 3
4 T 0
5 T 1
NA T 2
2 T 3
3 W 0
3 W 1
1 W 2
NA W 3
For a given NA value in Weight , I want to replace it with the average of the non-NA values having the same value for Hour . 对于权重中给定的NA值,我想将其替换为具有相同Hour值的非NA值的平均值。 For example, the first value in Weight is NA.
例如, Weight中的第一个值为NA。 Its Hour value is 0, so I want to average the other Weights where Hour is 0 (those values being 4 and 3).
它的小时值是0,所以我想对小时数为0(这些值是4和3)的其他权重取平均值。 I then want to replace the NA with the computed average (3.5).
然后,我想用计算的平均值(3.5)代替NA。
As an R beginner, I'd like to see a clear, multistep process for this. 作为R的初学者,我希望看到一个清晰的,多步骤的过程。 (I'm posing this as a learning exercise rather than a specific "solve this problem" type question. I'm not interested in who can do it in the fewest characters.)
(我将其视为学习练习,而不是特定的“解决此问题”类型的问题。我对谁能用最少的字符来做到这一点不感兴趣。)
You can use ave
for such operations. 您可以使用
ave
进行此类操作。
dat$Weight <-
ave(dat$Weight,dat$Hour,FUN=function(x){
mm <- mean(x,na.rm=TRUE)
ifelse(is.na(x),mm,x)
})
You could also use data.table
您也可以使用
data.table
library(data.table)
setDT(dat)[, list(Weight=replace(Weight, is.na(Weight),
mean(Weight, na.rm=TRUE))),by=Hour]
Or 要么
setDT(dat)[, Weight1:=mean(Weight, na.rm=TRUE), by=Hour][,
Weight:=ifelse(is.na(Weight), Weight1, Weight)][, Weight1:=NULL]
Here's a dplyr
solution. 这是
dplyr
解决方案。 It is both very fast and easy to understand (because of it's piped structure), thus could be good start for a beginner. 它非常快速且易于理解(因为它是管道结构),因此对于初学者而言可能是个不错的开始。 Assuming
df
is your data set 假设
df
是您的数据集
library(dplyr)
df %>% # Select your data set
group_by(Hour) %>% # Group by Hour
mutate(Weight = ifelse(is.na(Weight),
mean(Weight, na.rm = TRUE),
Weight)) # Replace all NAs with the mean
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.