[英]Impute variables within a data.frame group by factor column
I have a data.frame contain numeric columns, these columns have factor levels that I want to impute missing values by...let me explain. 我有一个data.frame包含数字列,这些列具有我希望通过...来计算缺失值的因子级别...让我解释一下。
part id value
a 1 23.4
a 2 23.8
a 3 45.6
a 4 34.7
a 5 Na
b 1 45.2
b 2 34.6
b 3 Na
b 4 30.9
b 5 28.1
Id like to impute the NA values with the mean of the part. 我想用部件的平均值来估算NA值。 So for part a, I'd like to impute the id 5 missing value with the mean of ids 1-4 in part a, and same for part b, impute missing id3 with the mean of ids in part b etc.
因此,对于a部分,我想将id 5缺失值与part a中的id 1-4的平均值相比较,并且对于b部分相同,将缺少的id3与b部分中的id的平均值相等。
I need to do this across many columns (imagine having many more value columns). 我需要在许多列中执行此操作(想象有更多的值列)。 So perhaps an apply with a function etc.
所以也许适用于功能等。
Using na.strings
argument in read.table/read.csv
we can convert the missing values to real NA
and thereby reading the 'value' columns as 'numeric'. 在
read.table/read.csv
使用na.strings
参数,我们可以将缺失值转换为实际NA
,从而将'value'列读为'numeric'。 With dplyr
, we can change replace
the NAs
in multiple value columns with mean
of that column. 使用
dplyr
,我们可以replace
该列的mean
更改多个值列中的NAs
。
library(dplyr)
df1 %>%
group_by(part) %>%
mutate_each(funs(replace(., which(is.na(.)), mean(., na.rm=TRUE))),
starts_with('value'))
Or a similar option with data.table
或者
data.table
的类似选项
library(data.table)
nm1 <- grep('value', names(df1))
setDT(df1)[, (nm1) := lapply(.SD, function(x) replace(x,
which(is.na(x)), mean(x, na.rm=TRUE))), by = part,.SDcols=nm1]
df1 <- read.table(text="part id value
a 1 23.4
a 2 23.8
a 3 45.6
a 4 34.7
a 5 Na
b 1 45.2
b 2 34.6
b 3 Na
b 4 30.9
b 5 28.1", header=TRUE, na.strings="Na", stringsAsFactors=FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.