繁体   English   中英

如何将缺失的行添加到数据框中

[英]How to add missing rows to a data frame

我有以下数据框,其中显示了 2010 年至 2020 年队列的两个表达式(是和否)的分布。

df <- structure(list(var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), levels = c("AAA", "BBB", "CCC", "DDD", 
"EEE", "FFF", 
"GGG", "HHH", "III", "JJJ", "KKK"
), class = "factor"), cohort = structure(c(1L, 1L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 9L, 9L, 10L, 10L, 11L, 11L), levels = c("2010", 
"2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", 
"2019", "2020"), class = "factor"), var2use = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("yes", 
"no"), label = c(rsz = "blabla"), class = c("labelled", 
"factor")), n = c(10L, 8L, 19L, 13L, 24L, 28L, 19L, 21L, 21L, 
16L, 23L, 13L, 38L, 25L, 24L, 28L), proportion = c(0.555555555555556, 
0.444444444444444, 0.59375, 0.40625, 0.461538461538462, 0.538461538461538, 
0.475, 0.525, 0.567567567567568, 0.432432432432432, 0.638888888888889, 
0.361111111111111, 0.603174603174603, 0.396825396825397, 0.461538461538462, 
0.538461538461538)), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -16L), groups = structure(list(
    var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 
    11L), levels = c("AAA", 
"BBB", "CCC", "DDD", 
"EEE", "FFF", 
"GGG", "HHH", "III", "JJJ", "KKK"), class = "factor"), cohort = structure(c(1L, 
    2L, 3L, 4L, 5L, 9L, 10L, 11L), levels = c("2010", "2011", 
    "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", 
    "2020"), class = "factor"), .rows = structure(list(1:2, 3:4, 
        5:6, 7:8, 9:10, 11:12, 13:14, 15:16), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))

不幸的是,某些cohort缺少相应的值(此处以 2015、2016 和 2017 为例)。 我正在寻找一种方法来自动将丢失的行添加到数据集中,其中列nproportion的内容应该是NA

也许complete包中的tidyr可以在这里使用?

您可以获取同期群年份范围并使用summarize()扩展数据集,然后在原始数据集上左连接:

df<-ungroup(df)

yrs = range(as.numeric(levels(df$cohort)))
unique(df[,c(1,3)]) %>% 
  group_by(var2kreuz,var2use) %>% 
  summarize(cohort = factor(yrs[1]:yrs[2])) %>% 
  left_join(df)

或者,您可以像这样使用complete()

df %>% mutate(across(c(var2kreuz, var2use),as.character)) %>% 
  complete(var2kreuz, var2use,cohort)

Output:

   var2kreuz var2use cohort  n proportion
1        KKK     yes   2010 10  0.5555556
2        KKK     yes   2011 19  0.5937500
3        KKK     yes   2012 24  0.4615385
4        KKK     yes   2013 19  0.4750000
5        KKK     yes   2014 21  0.5675676
6        KKK     yes   2015 NA         NA
7        KKK     yes   2016 NA         NA
8        KKK     yes   2017 NA         NA
9        KKK     yes   2018 23  0.6388889
10       KKK     yes   2019 38  0.6031746
11       KKK     yes   2020 24  0.4615385
12       KKK      no   2010  8  0.4444444
13       KKK      no   2011 13  0.4062500
14       KKK      no   2012 28  0.5384615
15       KKK      no   2013 21  0.5250000
16       KKK      no   2014 16  0.4324324
17       KKK      no   2015 NA         NA
18       KKK      no   2016 NA         NA
19       KKK      no   2017 NA         NA
20       KKK      no   2018 13  0.3611111
21       KKK      no   2019 25  0.3968254
22       KKK      no   2020 28  0.5384615

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM