简体   繁体   English

为什么 map_df 会产生很多缺失值? 我如何跨行连接以删除 NA?

[英]Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).我试图计算有多少学生在他们的学科中获得了 1s、2s、3s、4s 和 5s,我想要一个列用于每个学科和可能的等级(math_1、science_2 等)。

I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data.我最初写了一个 for 循环,但我的实际数据集有太多情况,我需要使用 map。我可以让它工作,但它会产生很多 NA,并且每列只有一个块有实际数据。 I'm curious to know either:我很想知道:

  1. Why is map_df() doing this and how can I avoid it?为什么 map_df() 这样做,我该如何避免呢? OR或者
  2. How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)?我怎样才能收紧这一点,以便在第一个数据集(18 行)中的每个原始行中只有一行有此信息? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).换句话说,我将列上下连接起来,所以所有的 NA 都被填充(除非确实缺少数据)。

Here's my code这是我的代码

library(tidyverse)

#Set up - generate sample dataset and get all combinations of grades and subjects

student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
                         subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
                         grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))

all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3", 
                "4", "5")

subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)

all_combos <- subjects_and_letter_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names = str_replace_all(names, "\\|", "_")) %>%
  pull(names)


# iterate over each combination using map_df()
student_map <- map_df(all_combos,
                        ~student_grades %>%
                          mutate("{.x}" := paste(i)) %>%
                          group_by(student_id) %>%
                          mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
                                                           str_detect(.x, grade) ~ 1,
                                                         TRUE ~ 0), na.rm = T)))

EDIT For the record, my almost identical for loop does not pad in many missing values.编辑 作为记录,我几乎相同的 for 循环没有填充许多缺失值。 I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.我认为它一定与它构建数据集的方式有关,但我不知道如何覆盖 map_df 在幕后所做的事情。

student_map <- student_grades
for(i in all_combos) {
  student_map <- student_map %>%
    mutate("{i}" := paste(i)) %>%
    group_by(student_id) %>%
    mutate("{i}" := sum(case_when(str_detect(i, subject) &
                                    str_detect(i, grade) ~ 1,
                                  TRUE ~ 0), na.rm = T)) 
}

There is no i in the map as the default lambda value looped is .x . map中没有i ,因为循环的默认 lambda 值是.x Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end此外,最好使用transmute而不是mutate ,因为我们只需要返回每次迭代中添加的列,然后我们在最后与原始数据绑定

library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
  ~ student_grades %>% 
  transmute(subject, grade, student_id, "{.x}" := .x) %>% 
  group_by(student_id) %>%  
  transmute("{.x}" := sum(case_when(str_detect( .x, subject) & 
      str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
  ungroup %>% 
  select(-student_id)) %>% 
   bind_cols(student_grades, .)

-checking with OP's for loop output - 检查 OP 的 for 循环 output

> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE

Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post .虽然我无法弄清楚为什么 map_df() 以这种不受欢迎的方式执行,但我确实找到了一个解决方案,在很大程度上受到这篇文章的回答的启发。

solution <- student_map %>% 
  group_by(student_id, subject, grade) %>%
  summarise_all(~ last(na.omit(.)))

solution

Basically, this code removes any NAs and only keeps missing values if there are only missing values.基本上,此代码会删除所有 NA,并且仅在只有缺失值时才保留缺失值。 Because those columns in my dataset will never have missing values, this solution works in my case.因为我数据集中的那些列永远不会有缺失值,所以这个解决方案适用于我的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM