簡體   English   中英

如何按行匯總多列中的前 n 個值?

[英]How to summarize the top n values across multiple columns row wise?

在我的 dataframe 中,我有多個包含學生成績的列。 我想總結“測驗”列(例如,Quiz1、Quiz2)。 但是,我只想總結前 2 個值,而忽略其他值。 我想用總數(即前兩個值的總和)創建一個新列。 還有一個問題是成績與給定行中的前 2 成績並列。 例如,Aaron 的高分是 42,但有兩個分數並列第二高(即 36)。

數據

df <- 
  structure(
  list(
    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
           98432, 19487, 94029),
    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
  ),
  row.names = c(NA, -10L),
  class = c("tbl_df", "tbl", "data.frame")
)

我知道我可以使用pivot_longer來做到這一點,它允許我按組排列,然后為每個學生取前 2 個值。 這工作正常,但我覺得應該有一個更有效的方式tidyverse ,而不是 pivot 來回。

我試過的

df %>%
  tidyr::pivot_longer(-c(Student, ID)) %>%
  dplyr::group_by(Student, ID) %>%
  dplyr::arrange(desc(value), .by_group = TRUE) %>%
  dplyr::slice_head(n = 2) %>%
  tidyr::pivot_wider(names_from = name, values_from = value) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))

我也知道,如果我想對每一行的所有列求和,那么我可以使用rowSums ,就像我在上面使用的那樣。 但是,我不確定如何對 4 個測驗列中的前 2 個值進行rowSums

預計 Output

# A tibble: 10 × 7
   Student      ID Quiz2 Quiz3 Quiz1 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    42    36    NA    NA    78
 2 Charlotte 61755    NA    NA    41    39    80
 3 Chloe     98432    49    NA    42    NA    91
 4 Grant     34096    48    44    NA    NA    92
 5 James     87311    NA    36    NA    43    79
 6 Judy      19487    NA    42    NA    41    83
 7 Justin    94029    NA    37    NA    41    78
 8 Katie     55323    NA    32    NA    46    78
 9 Olivia    94839    NA    43    NA    40    83
10 Timothy   38209    38    38    NA    NA    76

使用基本 R - select 只是測驗結果列,您可以將其視為矩陣。 按降序應用排序,對前兩個元素進行子集化,然后使用 colSums。

df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))

df
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    31    42    36    36    78
#>  2 James     87311    25    33    36    43    79
#>  3 Charlotte 61755    41    34    34    39    80
#>  4 Katie     55323    10    22    32    46    78
#>  5 Olivia    94839    35    23    43    40    83
#>  6 Timothy   38209    19    38    38    38    76
#>  7 Grant     34096    27    48    44    43    92
#>  8 Chloe     98432    42    49    42    35    91
#>  9 Judy      19487    15    23    42    41    83
#> 10 Justin    94029    20    30    37    41    78

基於這個 StackOverflow 答案

library(tidyverse)

df <- 
  structure(
    list(
      Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                  "Timothy", "Grant", "Chloe", "Judy", "Justin"),
      ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
             98432, 19487, 94029),
      Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
      Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
      Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
      Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
    ),
    row.names = c(NA, -10L),
    class = c("tbl_df", "tbl", "data.frame")
  )

df %>%
  rowwise() %>% 
  mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>% 
  ungroup()
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
#>  1 Aaron     30016    31    42    36    36         78
#>  2 James     87311    25    33    36    43         79
#>  3 Charlotte 61755    41    34    34    39         80
#>  4 Katie     55323    10    22    32    46         78
#>  5 Olivia    94839    35    23    43    40         83
#>  6 Timothy   38209    19    38    38    38         76
#>  7 Grant     34096    27    48    44    43         92
#>  8 Chloe     98432    42    49    42    35         91
#>  9 Judy      19487    15    23    42    41         83
#> 10 Justin    94029    20    30    37    41         78

(有點亂) Base R 解決方法:

# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)

# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]

# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))

# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
  do.call(
    rbind,
    lapply(
      with(
        df,
        split(
          df,
          Idx 
        )
      ),
      function(x){
        # Extract the top 2 quiz vectors: top_2_quizes => named integer vector
        top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
        # Calculate the quiz columns not used: remainder_quiz_cols => character vector
        remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
        # Nullify the remaining quizes: x => data.frame 
        x[, remainder_quiz_cols] <- NA_integer_
        # Calculate the resulting data.frame: data.frame => env 
        transform(
          cbind(
            x[,non_quiz_names], 
            x[,names(top_2_quizes)],
            x[,remainder_quiz_cols]
          ),
          Total = sum(top_2_quizes)
        )[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
      }
    )
  ),
  row.names = NULL,
  stringsAsFactors = FALSE
)

試試這個基礎 R

df <- cbind( df[,1:2], t( sapply( seq_along(1:nrow(df)), function(x){
  y <- order(df[x,3:6])[1:2]; z <- df[x,3:6]; z[y] <- NA; z } ) ) )

df$Total <- rowSums( matrix( unlist(df[,3:6]), dim(df[,3:6]) ), na.rm=T)
df
     Student    ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1      Aaron 30016    NA    42    NA    36    78
2      James 87311    NA    NA    36    43    79
3  Charlotte 61755    41    NA    NA    39    80
4      Katie 55323    NA    NA    32    46    78
5     Olivia 94839    NA    NA    43    40    83
6    Timothy 38209    NA    NA    38    38    76
7      Grant 34096    NA    48    44    NA    92
8      Chloe 98432    NA    49    42    NA    91
9       Judy 19487    NA    NA    42    41    83
10    Justin 94029    NA    NA    37    41    78

你不必做pivot_wider 請注意,較長的格式是整潔的格式。 只需做pivot_longerleft_join

df %>% 
  left_join(pivot_longer(., -c(Student, ID)) %>%
  group_by(Student, ID) %>%
  summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))

# A tibble: 10 x 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    31    42    36    36    78
 2 James     87311    25    33    36    43    79
 3 Charlotte 61755    41    34    34    39    80
 4 Katie     55323    10    22    32    46    78
 5 Olivia    94839    35    23    43    40    83
 6 Timothy   38209    19    38    38    38    76
 7 Grant     34096    27    48    44    43    92
 8 Chloe     98432    42    49    42    35    91
 9 Judy      19487    15    23    42    41    83
10 Justin    94029    20    30    37    41    78

正如上面提供的@akrun, collapse是另一種有效的可能性。 radixorder提供了一個 integer 排序向量,並且只保留每行中的前 2 個值,而將其他值替換為NA 然后, rowSums用於獲取每一行的總數。

library(collapse)

ftransform(gvr(df, "Student|ID"),
           dapply(
             gvr(df, "^Quiz"),
             MARGIN = 1,
             FUN = function(x)
               replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
           )) %>%
  ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))

Output

# A tibble: 10 × 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
 * <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    NA    42    NA    36    78
 2 James     87311    NA    NA    36    43    79
 3 Charlotte 61755    41    NA    NA    39    80
 4 Katie     55323    NA    NA    32    46    78
 5 Olivia    94839    NA    NA    43    40    83
 6 Timothy   38209    NA    NA    38    38    76
 7 Grant     34096    NA    48    44    NA    92
 8 Chloe     98432    NA    49    42    NA    91
 9 Judy      19487    NA    NA    42    41    83
10 Justin    94029    NA    NA    37    41    78

另一個基於tidyverse的解決方案:

library(tidyverse)

df %>% 
  rowwise %>% 
  mutate(ranks=list(rank(c_across(starts_with("Quiz")), ties.method="last"))) %>%
  unnest_wider(ranks, names_sep = "") %>% 
  mutate(across(starts_with("Quiz"), ~ if_else(get(str_c("ranks",
     parse_number(cur_column()))) >= 3, .x, NA_real_))) %>% 
  select(!starts_with("ranks")) %>%
  mutate(total = rowSums(.[,-(1:2)], na.rm = T))

#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    NA    42    36    NA    78
#>  2 James     87311    NA    NA    36    43    79
#>  3 Charlotte 61755    41    NA    NA    39    80
#>  4 Katie     55323    NA    NA    32    46    78
#>  5 Olivia    94839    NA    NA    43    40    83
#>  6 Timothy   38209    NA    38    38    NA    76
#>  7 Grant     34096    NA    48    44    NA    92
#>  8 Chloe     98432    42    49    NA    NA    91
#>  9 Judy      19487    NA    NA    42    41    83
#> 10 Justin    94029    NA    NA    37    41    78

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM