随机或按比例为NA分配分类值

Question

I have a dataset: 我有一个数据集：

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
"male"), Division = c("South Atlantic", "East North Central", 
"Pacific", "East North Central", "South Atlantic", "South Atlantic", 
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538, 
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn", 
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

I need to perform an analysis such that I can't have NA values in the gender variable. 我需要进行分析，以便在gender变量中不能有NA值。 The other columns are too few and have no known predictive value so that imputing the values isn't really possible. 其他列太少而且没有已知的预测值，因此实际上不可能输入值。

I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female or male into the missing cases. 我可以通过完全删除不完整的观察来执行分析 - 它们大约是数据集的4％，但我希望通过将female或male随机分配到丢失的案例中来查看结果。

Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NA s with female or male in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NA s? 除了编写一些非常丑陋的代码来过滤到不完整的案例，分成两部分并在每一半中用female或male替换NA ，我想知道是否有一种优雅的方式来随机或按比例将值分配给NA ？

Answer 1

我们可以使用ifelse和is.na来确定是否存在na ，然后使用sample随机选择female和male 。

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)

Answer 2

How about this: 这个怎么样：

> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
+                                 "male"),
+                      Division = c("South Atlantic", "East North Central", 
+                                   "Pacific", "East North Central", "South Atlantic", "South Atlantic", 
+                                   "Pacific"),
+                      Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+                                 107683.9118, 56149.3217, 46237.265),
+                      first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+                 row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
> 
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
> 
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
> 
> df$gender
[1] "female" "male"   "female" "female" "male"   "male"   "male"  
>

Thats is random with a given probability. 这是随机的，具有给定的概率。 You could also consider imputing values using nearest neighbors, hot desk, or similar. 您还可以考虑使用最近邻居，热门办公桌或类似设施来估算价值。

Hope it helps. 希望能帮助到你。

Answer 3

只需分配

df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]

随机或按比例为NA分配分类值

问题描述

3 个解决方案

解决方案1
4 已采纳 2019-02-23 20:55:58

解决方案2
4 2019-02-23 21:01:20

解决方案3
3 2019-02-23 21:03:09

随机或按比例为NA分配分类值

问题描述

3 个解决方案

解决方案1 4 已采纳 2019-02-23 20:55:58

解决方案2 4 2019-02-23 21:01:20

解决方案3 3 2019-02-23 21:03:09

解决方案1
4 已采纳 2019-02-23 20:55:58

解决方案2
4 2019-02-23 21:01:20

解决方案3
3 2019-02-23 21:03:09