[英]Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?
I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).我试图计算有多少学生在他们的学科中获得了 1s、2s、3s、4s 和 5s,我想要一个列用于每个学科和可能的等级(math_1、science_2 等)。
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data.我最初写了一个 for 循环,但我的实际数据集有太多情况,我需要使用 map。我可以让它工作,但它会产生很多 NA,并且每列只有一个块有实际数据。 I'm curious to know either:我很想知道:
Here's my code这是我的代码
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT For the record, my almost identical for loop does not pad in many missing values.编辑 作为记录,我几乎相同的 for 循环没有填充许多缺失值。 I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.我认为它一定与它构建数据集的方式有关,但我不知道如何覆盖 map_df 在幕后所做的事情。
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i
in the map
as the default lambda value looped is .x
. map
中没有i
,因为循环的默认 lambda 值是.x
。 Also, it is better to use transmute
instead of mutate
as we need to return only the columns added in each iteration and then we bind with the original data at the end此外,最好使用transmute
而不是mutate
,因为我们只需要返回每次迭代中添加的列,然后我们在最后与原始数据绑定
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output - 检查 OP 的 for 循环 output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post .虽然我无法弄清楚为什么 map_df() 以这种不受欢迎的方式执行,但我确实找到了一个解决方案,在很大程度上受到这篇文章的回答的启发。
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values.基本上,此代码会删除所有 NA,并且仅在只有缺失值时才保留缺失值。 Because those columns in my dataset will never have missing values, this solution works in my case.因为我数据集中的那些列永远不会有缺失值,所以这个解决方案适用于我的情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.