[英]Manipulating a column based on other column values with R dplyr package
I would like to choose best 2 results of quiz exams (highest score and highest attendance) for each student and eliminate the weakest quiz over 3 quiz exams.我想为每个学生选择最好的 2 个测验考试结果(最高分和最高出勤率),并消除 3 个测验考试中最弱的测验。 We might say that I would like to choose best 2 columns from 3 columns for each row.
我们可能会说,我想从每行的 3 列中选择最好的 2 列。 Then create a new data frame has
StudentID, ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal
.然后创建一个包含
StudentID, ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal
的新数据框。 I can handle it with looping through the table which is too inefficient in R I assume that.我可以通过循环遍历表格来处理它,这在 R 中效率太低我假设。 What is the efficient way to handle the issue with dplyr package?
处理 dplyr package 问题的有效方法是什么?
Minimalist data极简数据
The pseudo data frame is placed at the bottom.伪数据框放置在底部。 "
G
" means the student has not attended the exam so I would like to keep that value instead of replacing it into the 0. For instance, if he got this scenario with G ( ExamQuiz1
), 0 ( ExamQuiz2
), 10 ( ExamQuiz3
), I have to choose 0 as ExamQuiz1
and 10 as ExamQuiz2
for quiz inputs. “
G
”表示学生没有参加考试,所以我想保留该值而不是将其替换为 0。例如,如果他使用 G ( ExamQuiz1
)、0 ( ExamQuiz2
)、10 ( ExamQuiz3
) 得到这个场景,我必须选择 0 作为ExamQuiz1
和 10 作为ExamQuiz2
用于测验输入。 Because 0 is better than G
because of attendance side.因为出勤方面0比
G
好。 If there is a result (with numbers), it means that student has already attended.如果有结果(带数字),则表示学生已经参加。 Every single cell under the columns of
ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal
might have numeric (exam result) or character value (" G
" > not attended). ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal
列下的每个单元格都可能具有数字(考试结果)或字符值(“ G
”> 未参加)。 I will not touch any values of ExamMidterm and ExamFinal columns.我不会触及 ExamMidterm 和 ExamFinal 列的任何值。 The main idea is only related with the columns of
ExamQuiz1, ExamQuiz2, and ExamQuiz3
.主要思想仅与
ExamQuiz1, ExamQuiz2, and ExamQuiz3
的列相关。
StudentID ExamQuiz1 ExamQuiz2 ExamQuiz3 ExamMidterm ExamFinal
1 11111 0 G G G G
2 22222 0 G 43 71 18
3 33333 0 G G G G
4 44444 0 G G G G
5 55555 60 38 G 64 27
6 66666 0 G G G G
Edit : Still some of commenters constantly point that the data is not tidy.编辑:仍然有一些评论者不断指出数据不整洁。 As I explained on the comments, the reason for that or what you are offering to tidy it up do not make sense on my side.
正如我在评论中解释的那样,这样做的原因或您提供的整理方法对我来说没有意义。 For that reason, I placed more explanations on the question body without changing the structure of the data.
出于这个原因,我在问题主体上放置了更多的解释,而不改变数据的结构。
A base R solution一基R解决方案
cbind(df[-(2:4)], t(apply(df[2:4], 1, function(x){
c(x[x == "G"], sort(x[x != "G"]))[-1]
})))
# StudentID Midterm Final 1 2
# 1 11111 G G G 0
# 2 22222 71 18 0 43
# 3 33333 G G G 0
# 4 44444 G G G 0
# 5 55555 64 27 38 60
# 6 66666 G G G 0
In your rule, G
should be put in front of any numerics.在您的规则中,应将
G
放在任何数字前面。 So at first I put all existing G
to the beginning of a vector and append sorted scores.所以起初我把所有现有的
G
放在一个向量的开头,然后 append 排序分数。 After removing the first element in the vector, top 2 scores will remain.删除向量中的第一个元素后,将保留前 2 个分数。
Here's an approach with dplyr
's new across
(version 1.0.0
or higher):这是
dplyr
across
新方法(版本1.0.0
或更高版本):
Assuming no one can get a negative score and being absent is worse than getting zero, we can just set G
to be -1
.假设没有人可以得到负分并且缺席比得到零更糟糕,我们可以将
G
设置为-1
。
library(dplyr)
data %>%
mutate(across(-StudentID, ~case_when(. == "G" ~ -1,
TRUE ~ as.numeric(.)))) %>%
rowwise() %>%
mutate(TopQuiz = max(c_across(starts_with("Quiz"))),
SecondQuiz = sort(c_across(starts_with("Quiz")),
decreasing = TRUE)[2]) %>%
dplyr::select(StudentID, TopQuiz, SecondQuiz, Midterm, Final) %>%
mutate(across(-StudentID, ~case_when(. == -1 ~ "G",
TRUE ~ as.character(.))))
##A tibble: 6 x 5
## Rowwise:
# StudentID TopQuiz SecondQuiz Midterm Final
# <int> <chr> <chr> <chr> <chr>
#1 11111 0 G G G
#2 22222 43 0 71 18
#3 33333 0 G G G
#4 44444 0 G G G
#5 55555 60 38 64 27
#6 66666 0 G G G
Slightly different way of applying dplyr
and stringr
by making G NA to do the math and then putting NA back to G and returning to character.应用
dplyr
和stringr
的方式略有不同,方法是让 G NA 进行数学运算,然后将 NA 放回 G 并返回字符。
library(dplyr)
library(stringr)
newgrades <- grades %>%
mutate(across(starts_with("Quiz"), ~ str_replace(., "G", NA_character_))) %>%
mutate(across(starts_with("Quiz"), as.numeric)) %>%
rowwise() %>%
mutate(TopQuiz = max(c_across(starts_with("Quiz")), na.rm = TRUE),
NextBestQuiz = sort(c_across(starts_with("Quiz")),
decreasing = TRUE)[2]) %>%
mutate(across(ends_with("Quiz"), as.character)) %>%
mutate(across(ends_with("Quiz"), ~ str_replace_na(., replacement = "G"))) %>%
select(id, TopQuiz, NextBestQuiz, Midterm, Final)
newgrades
#> # A tibble: 6 x 5
#> # Rowwise:
#> id TopQuiz NextBestQuiz Midterm Final
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 0 G G G
#> 2 2 43 0 71 18
#> 3 3 0 G G G
#> 4 4 0 G G G
#> 5 5 60 38 64 27
#> 6 6 0 G G G
Your data您的数据
grades <- data.frame(
id = c(1:6),
Quiz1 = c("0","0","0","0","60","0"),
Quiz2 = c("G","G","G","G","38","G"),
Quiz3 = c("G","43","G","G","G","G"),
Midterm = c("G","71","G","G","64","G"),
Final = c("G","18","G","G","27","G")
)
`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.