繁体   English   中英

R中缺少字符串的插补

[英]Imputation for missing strings in R

我有一个大数据集,缺少20%的字符串。

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  NA
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

在我的大多数数据中,几乎很明显,安迪(Andy)负责销售,桑迪(Sandy)从事建筑。 但是我们不能确定温迪。

我理想的结果是:

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

哪个插补包最适合处理? 或者,也许您有更好的解决方案?

提前致谢!

也许您可以尝试根据每个组中的不同值进行条件填充

library(dplyr)

df %>%
  group_by(NAME) %>%
  mutate(AREA = if(n_distinct(AREA, na.rm = TRUE) == 1) first(AREA) else AREA)


#   NAME  AREA        
#   <fct> <fct>       
# 1 Andy  Sales       
# 2 Andy  Sales       
# 3 Andy  Sales       
# 4 Andy  Sales       
# 5 Andy  Sales       
# 6 Andy  Sales       
# 7 Sandy Construction
# 8 Sandy Construction
# 9 Sandy Construction
#10 Sandy Construction
#11 Sandy Construction
#12 Wendy Planting    
#13 Wendy Driving     
#14 Wendy NA          
#15 Wendy NA          
#16 Wendy NA      

数据

df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy", 
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L, 
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label = 
c("Construction", "Driving", "Planting", "Sales"), 
class = "factor")), class = "data.frame", row.names = c(NA, -16L))    

您可以使用mice包。 它是非常可定制的,但是一个简单的实现将是:

library(mice)
dt <- mutate(dt, AREA = as.factor(AREA)) #make sure that area is a categorical variable

imputed_dt <- mice(dt) %>% complete()

在此基本示例中,小鼠将尝试估算Wendy的值。 但是您应该深入研究文档

这是data.table一个选项

library(data.table)
setDT(df)[,  AREA := if(uniqueN(AREA, na.rm = TRUE) == 1) 
              first(AREA[!is.na(AREA)]) else AREA, NAME]
df
#     NAME         AREA
# 1:  Andy        Sales
# 2:  Andy        Sales
# 3:  Andy        Sales
# 4:  Andy        Sales
# 5:  Andy        Sales
# 6:  Andy        Sales
# 7: Sandy Construction
# 8: Sandy Construction
# 9: Sandy Construction
#10: Sandy Construction
#11: Sandy Construction
#12: Wendy     Planting
#13: Wendy      Driving
#14: Wendy         <NA>
#15: Wendy         <NA>
#16: Wendy         <NA>

数据

df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy", 
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L, 
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label = 
c("Construction", "Driving", "Planting", "Sales"), 
class = "factor")), class = "data.frame", row.names = c(NA, -16L))    

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM