[英]Imputing multiple columns in R using mutate_at
I have a large medical data frame that I want to use for ML.我有一个大型医疗数据框,我想用于机器学习。 As such, I have to impute missing values.因此,我必须估算缺失值。 For the continus variables I would like to put the median value, like so:对于连续变量,我想放置中值,如下所示:
dat$First_Wbc <- ifelse(is.na(dat$First_Wbc), median2(dat$First_Wbc), dat$First_Wbc)
I want to create a code using mutate_at that would do the same as the code I provided above, but for multiple variables at a time.我想使用 mutate_at 创建一个代码,它的作用与我上面提供的代码相同,但一次用于多个变量。 I know it's possible but so far I haven't been able to format it correctly.我知道这是可能的,但到目前为止我还没有能够正确地格式化它。 Can you please help me?你能帮我么?
Note: median2() is a function identical to median() but it ignores the missing values注意:median2() 是一个与median() 相同的函数,但它忽略了缺失值
You can select columns by position :您可以按位置选择列:
library(dplyr)
df %>% mutate_at(2:4, ~replace(., is.na(.), median2(.)))
Or by the range of columns或者按列的范围
df %>% mutate_at(vars(a:d), ~replace(., is.na(.), median2(.)))
Or using a specific pattern in the column names.或者在列名中使用特定模式。
df %>% mutate_at(vars(starts_with('col')), ~replace(., is.na(.), median2(.)))
Base R solution:基础 R 解决方案:
dat[,sapply(dat, is.numeric)] <- lapply(dat[,sapply(dat, is.numeric)],
function(x){
x <- ifelse(is.na(x), median(x, na.rm = TRUE), x)
}
)
Tidyverse using mutate_if: Tidyverse 使用 mutate_if:
library(tidyverse)
df %>%
mutate_if(is.numeric, funs(replace(., is.na(.), median(., na.rm = TRUE))))
We can use mutate_if
with na.aggregate
我们可以将mutate_if
与na.aggregate
mutate_if
使用
library(dplyr)
library(zoo)
df %>%
mutate_if(is.numeric, na.aggregate, FUN = median)
Speaking of tidy solutions I really like the naniar
package, it provides many useful methods for working with missing data.说到整洁的解决方案,我真的很喜欢naniar
包,它提供了许多处理缺失数据的有用方法。
Eg, here to impute medians in all numeric columns you could do:例如,在这里估算您可以执行的所有数字列中的中位数:
library(tidyverse)
library(naniar)
df %>%
impute_median_if(is.numeric)
More added values comes with impute_median_all()
, impute_mean_if()
and many great missing data visualizations. impute_median_all()
、 impute_mean_if()
和许多很棒的缺失数据可视化带来了更多的附加值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.