[英]Imputation for missing strings in R
我有一个大数据集,缺少20%的字符串。
NAME | AREA
--------------------------
Andy | Sales
Andy | NA
Andy | Sales
Andy | Sales
Andy | NA
Andy | Sales
Sandy | Construction
Sandy | Construction
Sandy | NA
Sandy | Construction
Sandy | Construction
Wendy | Planting
Wendy | Driving
Wendy | NA
Wendy | NA
Wendy | NA
在我的大多数数据中,几乎很明显,安迪(Andy)负责销售,桑迪(Sandy)从事建筑。 但是我们不能确定温迪。
我理想的结果是:
NAME | AREA
--------------------------
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Sandy | Construction
Sandy | Construction
Sandy | Construction
Sandy | Construction
Sandy | Construction
Wendy | Planting
Wendy | Driving
Wendy | NA
Wendy | NA
Wendy | NA
哪个插补包最适合处理? 或者,也许您有更好的解决方案?
提前致谢!
也许您可以尝试根据每个组中的不同值进行条件填充
library(dplyr)
df %>%
group_by(NAME) %>%
mutate(AREA = if(n_distinct(AREA, na.rm = TRUE) == 1) first(AREA) else AREA)
# NAME AREA
# <fct> <fct>
# 1 Andy Sales
# 2 Andy Sales
# 3 Andy Sales
# 4 Andy Sales
# 5 Andy Sales
# 6 Andy Sales
# 7 Sandy Construction
# 8 Sandy Construction
# 9 Sandy Construction
#10 Sandy Construction
#11 Sandy Construction
#12 Wendy Planting
#13 Wendy Driving
#14 Wendy NA
#15 Wendy NA
#16 Wendy NA
数据
df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy",
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L,
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label =
c("Construction", "Driving", "Planting", "Sales"),
class = "factor")), class = "data.frame", row.names = c(NA, -16L))
您可以使用mice
包。 它是非常可定制的,但是一个简单的实现将是:
library(mice)
dt <- mutate(dt, AREA = as.factor(AREA)) #make sure that area is a categorical variable
imputed_dt <- mice(dt) %>% complete()
在此基本示例中,小鼠将尝试估算Wendy的值。 但是您应该深入研究文档 。
这是data.table
一个选项
library(data.table)
setDT(df)[, AREA := if(uniqueN(AREA, na.rm = TRUE) == 1)
first(AREA[!is.na(AREA)]) else AREA, NAME]
df
# NAME AREA
# 1: Andy Sales
# 2: Andy Sales
# 3: Andy Sales
# 4: Andy Sales
# 5: Andy Sales
# 6: Andy Sales
# 7: Sandy Construction
# 8: Sandy Construction
# 9: Sandy Construction
#10: Sandy Construction
#11: Sandy Construction
#12: Wendy Planting
#13: Wendy Driving
#14: Wendy <NA>
#15: Wendy <NA>
#16: Wendy <NA>
df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy",
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L,
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label =
c("Construction", "Driving", "Planting", "Sales"),
class = "factor")), class = "data.frame", row.names = c(NA, -16L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.