[英]How to add missing rows to a data frame
我有以下数据框,其中显示了 2010 年至 2020 年队列的两个表达式(是和否)的分布。
df <- structure(list(var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), levels = c("AAA", "BBB", "CCC", "DDD",
"EEE", "FFF",
"GGG", "HHH", "III", "JJJ", "KKK"
), class = "factor"), cohort = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 9L, 9L, 10L, 10L, 11L, 11L), levels = c("2010",
"2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018",
"2019", "2020"), class = "factor"), var2use = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("yes",
"no"), label = c(rsz = "blabla"), class = c("labelled",
"factor")), n = c(10L, 8L, 19L, 13L, 24L, 28L, 19L, 21L, 21L,
16L, 23L, 13L, 38L, 25L, 24L, 28L), proportion = c(0.555555555555556,
0.444444444444444, 0.59375, 0.40625, 0.461538461538462, 0.538461538461538,
0.475, 0.525, 0.567567567567568, 0.432432432432432, 0.638888888888889,
0.361111111111111, 0.603174603174603, 0.396825396825397, 0.461538461538462,
0.538461538461538)), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -16L), groups = structure(list(
var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L), levels = c("AAA",
"BBB", "CCC", "DDD",
"EEE", "FFF",
"GGG", "HHH", "III", "JJJ", "KKK"), class = "factor"), cohort = structure(c(1L,
2L, 3L, 4L, 5L, 9L, 10L, 11L), levels = c("2010", "2011",
"2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019",
"2020"), class = "factor"), .rows = structure(list(1:2, 3:4,
5:6, 7:8, 9:10, 11:12, 13:14, 15:16), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
不幸的是,某些cohort
缺少相应的值(此处以 2015、2016 和 2017 为例)。 我正在寻找一种方法来自动将丢失的行添加到数据集中,其中列n
和proportion
的内容应该是NA
。
也许complete
包中的tidyr
可以在这里使用?
您可以获取同期群年份范围并使用summarize()
扩展数据集,然后在原始数据集上左连接:
df<-ungroup(df)
yrs = range(as.numeric(levels(df$cohort)))
unique(df[,c(1,3)]) %>%
group_by(var2kreuz,var2use) %>%
summarize(cohort = factor(yrs[1]:yrs[2])) %>%
left_join(df)
或者,您可以像这样使用complete()
:
df %>% mutate(across(c(var2kreuz, var2use),as.character)) %>%
complete(var2kreuz, var2use,cohort)
Output:
var2kreuz var2use cohort n proportion
1 KKK yes 2010 10 0.5555556
2 KKK yes 2011 19 0.5937500
3 KKK yes 2012 24 0.4615385
4 KKK yes 2013 19 0.4750000
5 KKK yes 2014 21 0.5675676
6 KKK yes 2015 NA NA
7 KKK yes 2016 NA NA
8 KKK yes 2017 NA NA
9 KKK yes 2018 23 0.6388889
10 KKK yes 2019 38 0.6031746
11 KKK yes 2020 24 0.4615385
12 KKK no 2010 8 0.4444444
13 KKK no 2011 13 0.4062500
14 KKK no 2012 28 0.5384615
15 KKK no 2013 21 0.5250000
16 KKK no 2014 16 0.4324324
17 KKK no 2015 NA NA
18 KKK no 2016 NA NA
19 KKK no 2017 NA NA
20 KKK no 2018 13 0.3611111
21 KKK no 2019 25 0.3968254
22 KKK no 2020 28 0.5384615
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.