用剩余的平均值估算缺失值

Question

I have a data frame of the form: 我有一个形式的数据框：

Weight  Day     Hour
NA      M       0
NA      M       1
2       M       2
1       M       3
4       T       0
5       T       1
NA      T       2
2       T       3
3       W       0
3       W       1
1       W       2
NA      W       3

For a given NA value in Weight , I want to replace it with the average of the non-NA values having the same value for Hour . 对于权重中给定的NA值，我想将其替换为具有相同Hour值的非NA值的平均值。 For example, the first value in Weight is NA. 例如， Weight中的第一个值为NA。 Its Hour value is 0, so I want to average the other Weights where Hour is 0 (those values being 4 and 3). 它的小时值是0，所以我想对小时数为0（这些值是4和3）的其他权重取平均值。 I then want to replace the NA with the computed average (3.5). 然后，我想用计算的平均值（3.5）代替NA。

As an R beginner, I'd like to see a clear, multistep process for this. 作为R的初学者，我希望看到一个清晰的，多步骤的过程。 (I'm posing this as a learning exercise rather than a specific "solve this problem" type question. I'm not interested in who can do it in the fewest characters.) （我将其视为学习练习，而不是特定的“解决此问题”类型的问题。我对谁能用最少的字符来做到这一点不感兴趣。）

Answer 1

You can use ave for such operations. 您可以使用ave进行此类操作。

dat$Weight <- 
ave(dat$Weight,dat$Hour,FUN=function(x){
  mm <- mean(x,na.rm=TRUE)
  ifelse(is.na(x),mm,x)
})

You will apply a function by group of hours. 您将按小时数应用功能。
For each group you compute the mean wuthout missing values. 对于每个组，您将计算平均wuthout缺失值。
You assign the mean if the value is a missing value otherwise you keep the origin value. 如果值是缺失值，则分配平均值，否则保留原始值。
You replace the Weight vector by the new created vector. 您用新创建的向量替换“权重”向量。

Answer 2

You could also use data.table 您也可以使用data.table

library(data.table)
 setDT(dat)[, list(Weight=replace(Weight, is.na(Weight),
       mean(Weight, na.rm=TRUE))),by=Hour]

Or 要么

setDT(dat)[, Weight1:=mean(Weight, na.rm=TRUE), by=Hour][,
              Weight:=ifelse(is.na(Weight), Weight1, Weight)][, Weight1:=NULL]

Answer 3

Here's a dplyr solution. 这是dplyr解决方案。 It is both very fast and easy to understand (because of it's piped structure), thus could be good start for a beginner. 它非常快速且易于理解（因为它是管道结构），因此对于初学者而言可能是个不错的开始。 Assuming df is your data set 假设df是您的数据集

library(dplyr)
df %>% # Select your data set
  group_by(Hour) %>% # Group by Hour
  mutate(Weight = ifelse(is.na(Weight), 
                         mean(Weight, na.rm = TRUE), 
                         Weight)) # Replace all NAs with the mean

用剩余的平均值估算缺失值

问题描述

3 个解决方案

解决方案1
4 已采纳 2014-09-10 15:58:31

解决方案2
4 2014-09-10 16:39:46

解决方案3
4 2014-09-10 18:57:05

用剩余的平均值估算缺失值

问题描述

3 个解决方案

解决方案1 4 已采纳 2014-09-10 15:58:31

解决方案2 4 2014-09-10 16:39:46

解决方案3 4 2014-09-10 18:57:05

解决方案1
4 已采纳 2014-09-10 15:58:31

解决方案2
4 2014-09-10 16:39:46

解决方案3
4 2014-09-10 18:57:05