简体   繁体   English

使用 R dplyr package 根据其他列值操作列

[英]Manipulating a column based on other column values with R dplyr package

I would like to choose best 2 results of quiz exams (highest score and highest attendance) for each student and eliminate the weakest quiz over 3 quiz exams.我想为每个学生选择最好的 2 个测验考试结果(最高分和最高出勤率),并消除 3 个测验考试中最弱的测验。 We might say that I would like to choose best 2 columns from 3 columns for each row.我们可能会说,我想从每行的 3 列中选择最好的 2 列。 Then create a new data frame has StudentID, ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal .然后创建一个包含StudentID, ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal的新数据框。 I can handle it with looping through the table which is too inefficient in R I assume that.我可以通过循环遍历表格来处理它,这在 R 中效率太低我假设。 What is the efficient way to handle the issue with dplyr package?处理 dplyr package 问题的有效方法是什么?

Minimalist data极简数据

The pseudo data frame is placed at the bottom.伪数据框放置在底部。 " G " means the student has not attended the exam so I would like to keep that value instead of replacing it into the 0. For instance, if he got this scenario with G ( ExamQuiz1 ), 0 ( ExamQuiz2 ), 10 ( ExamQuiz3 ), I have to choose 0 as ExamQuiz1 and 10 as ExamQuiz2 for quiz inputs. G ”表示学生没有参加考试,所以我想保留该值而不是将其替换为 0。例如,如果他使用 G ( ExamQuiz1 )、0 ( ExamQuiz2 )、10 ( ExamQuiz3 ) 得到这个场景,我必须选择 0 作为ExamQuiz1和 10 作为ExamQuiz2用于测验输入。 Because 0 is better than G because of attendance side.因为出勤方面0比G好。 If there is a result (with numbers), it means that student has already attended.如果有结果(带数字),则表示学生已经参加。 Every single cell under the columns of ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal might have numeric (exam result) or character value (" G " > not attended). ExamQuiz1, ExamQuiz2, ExamMidterm and ExamFinal列下的每个单元格都可能具有数字(考试结果)或字符值(“ G ”> 未参加)。 I will not touch any values of ExamMidterm and ExamFinal columns.我不会触及 ExamMidterm 和 ExamFinal 列的任何值。 The main idea is only related with the columns of ExamQuiz1, ExamQuiz2, and ExamQuiz3 .主要思想仅与ExamQuiz1, ExamQuiz2, and ExamQuiz3的列相关。

   StudentID  ExamQuiz1  ExamQuiz2  ExamQuiz3  ExamMidterm  ExamFinal
1      11111          0          G          G            G          G
2      22222          0          G         43           71         18
3      33333          0          G          G            G          G
4      44444          0          G          G            G          G
5      55555         60         38          G           64         27
6      66666          0          G          G            G          G

Edit : Still some of commenters constantly point that the data is not tidy.编辑:仍然有一些评论者不断指出数据不整洁。 As I explained on the comments, the reason for that or what you are offering to tidy it up do not make sense on my side.正如我在评论中解释的那样,这样做的原因或您提供的整理方法对我来说没有意义。 For that reason, I placed more explanations on the question body without changing the structure of the data.出于这个原因,我在问题主体上放置了更多的解释,而不改变数据的结构。

A base R solution一基R解决方案

cbind(df[-(2:4)], t(apply(df[2:4], 1, function(x){
  c(x[x == "G"], sort(x[x != "G"]))[-1]
})))

#   StudentID Midterm Final  1  2
# 1     11111       G     G  G  0
# 2     22222      71    18  0 43
# 3     33333       G     G  G  0
# 4     44444       G     G  G  0
# 5     55555      64    27 38 60
# 6     66666       G     G  G  0

In your rule, G should be put in front of any numerics.在您的规则中,应将G放在任何数字前面。 So at first I put all existing G to the beginning of a vector and append sorted scores.所以起初我把所有现有的G放在一个向量的开头,然后 append 排序分数。 After removing the first element in the vector, top 2 scores will remain.删除向量中的第一个元素后,将保留前 2 个分数。

Here's an approach with dplyr 's new across (version 1.0.0 or higher):这是dplyr across新方法(版本1.0.0或更高版本):

Assuming no one can get a negative score and being absent is worse than getting zero, we can just set G to be -1 .假设没有人可以得到负分并且缺席比得到零更糟糕,我们可以将G设置为-1

library(dplyr)
data %>% 
  mutate(across(-StudentID, ~case_when(. == "G" ~ -1,
                                       TRUE ~ as.numeric(.)))) %>%
  rowwise() %>%
  mutate(TopQuiz = max(c_across(starts_with("Quiz"))),
         SecondQuiz = sort(c_across(starts_with("Quiz")),
                           decreasing = TRUE)[2]) %>%
  dplyr::select(StudentID, TopQuiz, SecondQuiz, Midterm, Final) %>%
  mutate(across(-StudentID, ~case_when(. == -1 ~ "G",
                                       TRUE ~ as.character(.))))
##A tibble: 6 x 5
## Rowwise: 
#  StudentID TopQuiz SecondQuiz Midterm Final
#      <int> <chr>   <chr>      <chr>   <chr>
#1     11111 0       G          G       G    
#2     22222 43      0          71      18   
#3     33333 0       G          G       G    
#4     44444 0       G          G       G    
#5     55555 60      38         64      27   
#6     66666 0       G          G       G     

Slightly different way of applying dplyr and stringr by making G NA to do the math and then putting NA back to G and returning to character.应用dplyrstringr的方式略有不同,方法是让 G NA 进行数学运算,然后将 NA 放回 G 并返回字符。

library(dplyr)
library(stringr)


newgrades <- grades %>% 
  mutate(across(starts_with("Quiz"), ~ str_replace(., "G", NA_character_))) %>%
  mutate(across(starts_with("Quiz"), as.numeric)) %>%
  rowwise() %>%
  mutate(TopQuiz = max(c_across(starts_with("Quiz")), na.rm = TRUE),
         NextBestQuiz = sort(c_across(starts_with("Quiz")),
                             decreasing = TRUE)[2]) %>%
  mutate(across(ends_with("Quiz"), as.character)) %>%
  mutate(across(ends_with("Quiz"), ~ str_replace_na(., replacement = "G"))) %>%
  select(id, TopQuiz, NextBestQuiz, Midterm, Final)

newgrades
#> # A tibble: 6 x 5
#> # Rowwise: 
#>      id TopQuiz NextBestQuiz Midterm Final
#>   <int> <chr>   <chr>        <chr>   <chr>
#> 1     1 0       G            G       G    
#> 2     2 43      0            71      18   
#> 3     3 0       G            G       G    
#> 4     4 0       G            G       G    
#> 5     5 60      38           64      27   
#> 6     6 0       G            G       G

Your data您的数据

grades <- data.frame(
  id = c(1:6),
  Quiz1 = c("0","0","0","0","60","0"),
  Quiz2 = c("G","G","G","G","38","G"),
  Quiz3 = c("G","43","G","G","G","G"),
  Midterm = c("G","71","G","G","64","G"),
  Final = c("G","18","G","G","27","G")
)
`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用dplyr基于列值对R中的值求和 - Summing values in R based on column value with dplyr 在 dplyr package 中,您可以根据不同列中的值改变列 - In the dplyr package can you mutate a column based on the values in a different column 使用dplyr mutuate(或其他程序包)基于每行中特定值的计数来创建新列 - Using dplyr mutuate (or other package) to create new column based on count of specific values in each row R 使用 Dplyr 根据其他 5 列的值创建包含 6 个可能值中的 1 个的列 - R Use Dplyr to create a column that contain 1 of 6 possible values based on the value of 5 other columns 根据 R 中数据框中所有其他列中的字符串值,使用 dplyr 创建一个新列 - Create a new column using dplyr based on string values in all other columns in a data frame in R R dplyr-根据其他行中的结果添加列 - R dplyr - Add Column Based on Results in Other Rows 根据是否在其他行中重复,在R中使用dplyr添加一列 - add a column using dplyr in R based on if duplicated in other rows 如何基于R中的其他列值将值放入列中 - How to put values inside a column based on other column values in R 根据 R 中其他列中的值随机更新列值 - Update column values randomly based on value in other column in R 如何根据 R 中的其他列对列中的值求和 - How to sum values in a column based on other column(s) in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM