[英]How to summarize the top n values across multiple columns row wise?
在我的 dataframe 中,我有多個包含學生成績的列。 我想總結“測驗”列(例如,Quiz1、Quiz2)。 但是,我只想總結前 2 個值,而忽略其他值。 我想用總數(即前兩個值的總和)創建一個新列。 還有一個問題是成績與給定行中的前 2 成績並列。 例如,Aaron 的高分是 42,但有兩個分數並列第二高(即 36)。
數據
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
我知道我可以使用pivot_longer
來做到這一點,它允許我按組排列,然后為每個學生取前 2 個值。 這工作正常,但我覺得應該有一個更有效的方式tidyverse
,而不是 pivot 來回。
我試過的
df %>%
tidyr::pivot_longer(-c(Student, ID)) %>%
dplyr::group_by(Student, ID) %>%
dplyr::arrange(desc(value), .by_group = TRUE) %>%
dplyr::slice_head(n = 2) %>%
tidyr::pivot_wider(names_from = name, values_from = value) %>%
dplyr::ungroup() %>%
dplyr::mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))
我也知道,如果我想對每一行的所有列求和,那么我可以使用rowSums
,就像我在上面使用的那樣。 但是,我不確定如何對 4 個測驗列中的前 2 個值進行rowSums
。
預計 Output
# A tibble: 10 × 7
Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 42 36 NA NA 78
2 Charlotte 61755 NA NA 41 39 80
3 Chloe 98432 49 NA 42 NA 91
4 Grant 34096 48 44 NA NA 92
5 James 87311 NA 36 NA 43 79
6 Judy 19487 NA 42 NA 41 83
7 Justin 94029 NA 37 NA 41 78
8 Katie 55323 NA 32 NA 46 78
9 Olivia 94839 NA 43 NA 40 83
10 Timothy 38209 38 38 NA NA 76
使用基本 R - select 只是測驗結果列,您可以將其視為矩陣。 按降序應用排序,對前兩個元素進行子集化,然后使用 colSums。
df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
df
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
library(tidyverse)
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
df %>%
rowwise() %>%
mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>%
ungroup()
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
(有點亂) Base R 解決方法:
# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
do.call(
rbind,
lapply(
with(
df,
split(
df,
Idx
)
),
function(x){
# Extract the top 2 quiz vectors: top_2_quizes => named integer vector
top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
# Calculate the quiz columns not used: remainder_quiz_cols => character vector
remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
# Nullify the remaining quizes: x => data.frame
x[, remainder_quiz_cols] <- NA_integer_
# Calculate the resulting data.frame: data.frame => env
transform(
cbind(
x[,non_quiz_names],
x[,names(top_2_quizes)],
x[,remainder_quiz_cols]
),
Total = sum(top_2_quizes)
)[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
}
)
),
row.names = NULL,
stringsAsFactors = FALSE
)
試試這個基礎 R 。
df <- cbind( df[,1:2], t( sapply( seq_along(1:nrow(df)), function(x){
y <- order(df[x,3:6])[1:2]; z <- df[x,3:6]; z[y] <- NA; z } ) ) )
df$Total <- rowSums( matrix( unlist(df[,3:6]), dim(df[,3:6]) ), na.rm=T)
df
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
你不必做pivot_wider
。 請注意,較長的格式是整潔的格式。 只需做pivot_longer
和left_join
:
df %>%
left_join(pivot_longer(., -c(Student, ID)) %>%
group_by(Student, ID) %>%
summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
# A tibble: 10 x 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 31 42 36 36 78
2 James 87311 25 33 36 43 79
3 Charlotte 61755 41 34 34 39 80
4 Katie 55323 10 22 32 46 78
5 Olivia 94839 35 23 43 40 83
6 Timothy 38209 19 38 38 38 76
7 Grant 34096 27 48 44 43 92
8 Chloe 98432 42 49 42 35 91
9 Judy 19487 15 23 42 41 83
10 Justin 94029 20 30 37 41 78
正如上面提供的@akrun, collapse
是另一種有效的可能性。 radixorder
提供了一個 integer 排序向量,並且只保留每行中的前 2 個值,而將其他值替換為NA
。 然后, rowSums
用於獲取每一行的總數。
library(collapse)
ftransform(gvr(df, "Student|ID"),
dapply(
gvr(df, "^Quiz"),
MARGIN = 1,
FUN = function(x)
replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
)) %>%
ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
Output
# A tibble: 10 × 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
另一個基於tidyverse
的解決方案:
library(tidyverse)
df %>%
rowwise %>%
mutate(ranks=list(rank(c_across(starts_with("Quiz")), ties.method="last"))) %>%
unnest_wider(ranks, names_sep = "") %>%
mutate(across(starts_with("Quiz"), ~ if_else(get(str_c("ranks",
parse_number(cur_column()))) >= 3, .x, NA_real_))) %>%
select(!starts_with("ranks")) %>%
mutate(total = rowSums(.[,-(1:2)], na.rm = T))
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 NA 42 36 NA 78
#> 2 James 87311 NA NA 36 43 79
#> 3 Charlotte 61755 41 NA NA 39 80
#> 4 Katie 55323 NA NA 32 46 78
#> 5 Olivia 94839 NA NA 43 40 83
#> 6 Timothy 38209 NA 38 38 NA 76
#> 7 Grant 34096 NA 48 44 NA 92
#> 8 Chloe 98432 42 49 NA NA 91
#> 9 Judy 19487 NA NA 42 41 83
#> 10 Justin 94029 NA NA 37 41 78
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.