[英]Assigning categorical values to NAs randomly or proportionally
I have a dataset: 我有一个数据集:
df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
"male"), Division = c("South Atlantic", "East North Central",
"Pacific", "East North Central", "South Atlantic", "South Atlantic",
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I need to perform an analysis such that I can't have NA
values in the gender
variable. 我需要进行分析,以便在
gender
变量中不能有NA
值。 The other columns are too few and have no known predictive value so that imputing the values isn't really possible. 其他列太少而且没有已知的预测值,因此实际上不可能输入值。
I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female
or male
into the missing cases. 我可以通过完全删除不完整的观察来执行分析 - 它们大约是数据集的4%,但我希望通过将
female
或male
随机分配到丢失的案例中来查看结果。
Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NA
s with female
or male
in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NA
s? 除了编写一些非常丑陋的代码来过滤到不完整的案例,分成两部分并在每一半中用
female
或male
替换NA
,我想知道是否有一种优雅的方式来随机或按比例将值分配给NA
?
我们可以使用ifelse
和is.na
来确定是否存在na
,然后使用sample
随机选择female
和male
。
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
How about this: 这个怎么样:
> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
+ "male"),
+ Division = c("South Atlantic", "East North Central",
+ "Pacific", "East North Central", "South Atlantic", "South Atlantic",
+ "Pacific"),
+ Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+ 107683.9118, 56149.3217, 46237.265),
+ first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+ row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
>
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
>
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
>
> df$gender
[1] "female" "male" "female" "female" "male" "male" "male"
>
Thats is random with a given probability. 这是随机的,具有给定的概率。 You could also consider imputing values using nearest neighbors, hot desk, or similar.
您还可以考虑使用最近邻居,热门办公桌或类似设施来估算价值。
Hope it helps. 希望能帮助到你。
只需分配
df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.