简体   繁体   中英

Manipulating a column based on other column values with R dplyr package

I would like to choose best 2 results of quiz exams (highest score and highest attendance) for each student and eliminate the weakest quiz over 3 quiz exams. We might say that I would like to choose best 2 columns from 3 columns for each row. Then create a new data frame has StudentID, ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal . I can handle it with looping through the table which is too inefficient in R I assume that. What is the efficient way to handle the issue with dplyr package?

Minimalist data

The pseudo data frame is placed at the bottom. " G " means the student has not attended the exam so I would like to keep that value instead of replacing it into the 0. For instance, if he got this scenario with G ( ExamQuiz1 ), 0 ( ExamQuiz2 ), 10 ( ExamQuiz3 ), I have to choose 0 as ExamQuiz1 and 10 as ExamQuiz2 for quiz inputs. Because 0 is better than G because of attendance side. If there is a result (with numbers), it means that student has already attended. Every single cell under the columns of ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal might have numeric (exam result) or character value (" G " > not attended). I will not touch any values of ExamMidterm and ExamFinal columns. The main idea is only related with the columns of ExamQuiz1, ExamQuiz2, and ExamQuiz3 .

   StudentID  ExamQuiz1  ExamQuiz2  ExamQuiz3  ExamMidterm  ExamFinal
1      11111          0          G          G            G          G
2      22222          0          G         43           71         18
3      33333          0          G          G            G          G
4      44444          0          G          G            G          G
5      55555         60         38          G           64         27
6      66666          0          G          G            G          G

Edit : Still some of commenters constantly point that the data is not tidy. As I explained on the comments, the reason for that or what you are offering to tidy it up do not make sense on my side. For that reason, I placed more explanations on the question body without changing the structure of the data.

A base R solution

cbind(df[-(2:4)], t(apply(df[2:4], 1, function(x){
  c(x[x == "G"], sort(x[x != "G"]))[-1]
})))

#   StudentID Midterm Final  1  2
# 1     11111       G     G  G  0
# 2     22222      71    18  0 43
# 3     33333       G     G  G  0
# 4     44444       G     G  G  0
# 5     55555      64    27 38 60
# 6     66666       G     G  G  0

In your rule, G should be put in front of any numerics. So at first I put all existing G to the beginning of a vector and append sorted scores. After removing the first element in the vector, top 2 scores will remain.

Here's an approach with dplyr 's new across (version 1.0.0 or higher):

Assuming no one can get a negative score and being absent is worse than getting zero, we can just set G to be -1 .

library(dplyr)
data %>% 
  mutate(across(-StudentID, ~case_when(. == "G" ~ -1,
                                       TRUE ~ as.numeric(.)))) %>%
  rowwise() %>%
  mutate(TopQuiz = max(c_across(starts_with("Quiz"))),
         SecondQuiz = sort(c_across(starts_with("Quiz")),
                           decreasing = TRUE)[2]) %>%
  dplyr::select(StudentID, TopQuiz, SecondQuiz, Midterm, Final) %>%
  mutate(across(-StudentID, ~case_when(. == -1 ~ "G",
                                       TRUE ~ as.character(.))))
##A tibble: 6 x 5
## Rowwise: 
#  StudentID TopQuiz SecondQuiz Midterm Final
#      <int> <chr>   <chr>      <chr>   <chr>
#1     11111 0       G          G       G    
#2     22222 43      0          71      18   
#3     33333 0       G          G       G    
#4     44444 0       G          G       G    
#5     55555 60      38         64      27   
#6     66666 0       G          G       G     

Slightly different way of applying dplyr and stringr by making G NA to do the math and then putting NA back to G and returning to character.

library(dplyr)
library(stringr)


newgrades <- grades %>% 
  mutate(across(starts_with("Quiz"), ~ str_replace(., "G", NA_character_))) %>%
  mutate(across(starts_with("Quiz"), as.numeric)) %>%
  rowwise() %>%
  mutate(TopQuiz = max(c_across(starts_with("Quiz")), na.rm = TRUE),
         NextBestQuiz = sort(c_across(starts_with("Quiz")),
                             decreasing = TRUE)[2]) %>%
  mutate(across(ends_with("Quiz"), as.character)) %>%
  mutate(across(ends_with("Quiz"), ~ str_replace_na(., replacement = "G"))) %>%
  select(id, TopQuiz, NextBestQuiz, Midterm, Final)

newgrades
#> # A tibble: 6 x 5
#> # Rowwise: 
#>      id TopQuiz NextBestQuiz Midterm Final
#>   <int> <chr>   <chr>        <chr>   <chr>
#> 1     1 0       G            G       G    
#> 2     2 43      0            71      18   
#> 3     3 0       G            G       G    
#> 4     4 0       G            G       G    
#> 5     5 60      38           64      27   
#> 6     6 0       G            G       G

Your data

grades <- data.frame(
  id = c(1:6),
  Quiz1 = c("0","0","0","0","60","0"),
  Quiz2 = c("G","G","G","G","38","G"),
  Quiz3 = c("G","43","G","G","G","G"),
  Midterm = c("G","71","G","G","64","G"),
  Final = c("G","18","G","G","27","G")
)
`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM