简体   繁体   English

用剩余的平均值估算缺失值

[英]Impute missing values with the average of the remainder

I have a data frame of the form: 我有一个形式的数据框:

Weight  Day     Hour
NA      M       0
NA      M       1
2       M       2
1       M       3
4       T       0
5       T       1
NA      T       2
2       T       3
3       W       0
3       W       1
1       W       2
NA      W       3

For a given NA value in Weight , I want to replace it with the average of the non-NA values having the same value for Hour . 对于权重中给定的NA值,我想将其替换为具有相同Hour值的非NA值的平均值。 For example, the first value in Weight is NA. 例如, Weight中的第一个值为NA。 Its Hour value is 0, so I want to average the other Weights where Hour is 0 (those values being 4 and 3). 它的小时值是0,所以我想对小时数为0(这些值是4和3)的其他权重取平均值。 I then want to replace the NA with the computed average (3.5). 然后,我想用计算的平均值(3.5)代替NA。

As an R beginner, I'd like to see a clear, multistep process for this. 作为R的初学者,我希望看到一个清晰的,多步骤的过程。 (I'm posing this as a learning exercise rather than a specific "solve this problem" type question. I'm not interested in who can do it in the fewest characters.) (我将其视为学习练习,而不是特定的“解决此问题”类型的问题。我对谁能用最少的字符来做到这一点不感兴趣。)

You can use ave for such operations. 您可以使用ave进行此类操作。

dat$Weight <- 
ave(dat$Weight,dat$Hour,FUN=function(x){
  mm <- mean(x,na.rm=TRUE)
  ifelse(is.na(x),mm,x)
})
  • You will apply a function by group of hours. 您将按小时数应用功能。
  • For each group you compute the mean wuthout missing values. 对于每个组,您将计算平均wuthout缺失值。
  • You assign the mean if the value is a missing value otherwise you keep the origin value. 如果值是缺失值,则分配平均值,否则保留原始值。
  • You replace the Weight vector by the new created vector. 您用新创建的向量替换“权重”向量。

You could also use data.table 您也可以使用data.table

library(data.table)
 setDT(dat)[, list(Weight=replace(Weight, is.na(Weight),
       mean(Weight, na.rm=TRUE))),by=Hour]

Or 要么

setDT(dat)[, Weight1:=mean(Weight, na.rm=TRUE), by=Hour][,
              Weight:=ifelse(is.na(Weight), Weight1, Weight)][, Weight1:=NULL]

Here's a dplyr solution. 这是dplyr解决方案。 It is both very fast and easy to understand (because of it's piped structure), thus could be good start for a beginner. 它非常快速且易于理解(因为它是管道结构),因此对于初学者而言可能是个不错的开始。 Assuming df is your data set 假设df是您的数据集

library(dplyr)
df %>% # Select your data set
  group_by(Hour) %>% # Group by Hour
  mutate(Weight = ifelse(is.na(Weight), 
                         mean(Weight, na.rm = TRUE), 
                         Weight)) # Replace all NAs with the mean

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM