[英]R function to extract top n scores from a dataframe and find their average using `apply` or dplyr `rowwise`
dataframe 看起來像這樣
df = data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
我想提取每個“名稱”的前 3 個考試分數,並使用apply()或 dplyr rowwise()函數找到它們的平均值。
使用apply
,使用MARGIN = 1
,循環遍歷數字列上的行, sort
,根據decreasing = TRUE/FALSE
獲取head/tail
,並返回base R
中的mean
apply(df[-1], 1, FUN = function(x) mean(head(sort(x, decreasing = TRUE), 3)))
[1] 3.333333 4.666667 5.000000
或使用dplyr/rowwise
library(dplyr)
df %>%
rowwise %>%
mutate(Mean = mean(head(sort(c_across(where(is.numeric)),
decreasing = TRUE), 3))) %>%
ungroup
# A tibble: 3 × 6
name exam1 exam2 exam3 exam4 Mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2 3 5 1 3.33
2 B 6 5 3 NA 4.67
3 C 4 6 3 5 5
這是一種使用旋轉和使用top_n
的替代方法:這將只返回前 3 個:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-name,
names_to = "exam",
values_to = "value"
) %>%
group_by(name) %>%
top_n(3, value) %>%
mutate(mean = mean(value)) %>%
pivot_wider(
names_from = exam,
values_from = value
)
name mean exam1 exam2 exam3 exam4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 3.33 2 3 5 NA
2 B 4.67 6 5 3 NA
3 C 5 4 6 NA 5
或者:
library(tidyr)
df %>%
pivot_longer(
-name,
names_to = "exam",
values_to = "value"
) %>%
group_by(name) %>%
top_n(3, value) %>%
summarise(mean = mean(value))
name mean
<chr> <dbl>
1 A 3.33
2 B 4.67
3 C 5
我回到這個問題並嘗試使用基本的 dplyr 操作“df”,這也有效,就像早期帖子中的一些真正有用的解決方案一樣。
df_long <- df %>%
pivot_longer(cols = -name,
names_to = "exam",
values_to = "score")
df_long %>%
group_by(name) %>%
arrange(desc(score)) %>%
slice(1:3) %>%
summarise(mean_score = mean(score))
@Paul Smith 添加inner_join(df)
的好主意
另一種可能的解決方案,基於tidyr::pivot_longer
並且不使用rowwise
:
library(tidyverse)
df = data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
df %>%
pivot_longer(cols = 2:5, names_to = "names") %>%
group_by(name) %>%
slice_max(value, n=3) %>%
summarise(mean = mean(value)) %>%
inner_join(df)
#> Joining, by = "name"
#> # A tibble: 3 × 6
#> name mean exam1 exam2 exam3 exam4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 3.33 2 3 5 1
#> 2 B 4.67 6 5 3 NA
#> 3 C 5 4 6 3 5
我會采用@akrun 並添加na.rm
參數,以防萬一您在未來的方法中需要它,最高分可以通過 NA 結果進行搜索。
最終結果將是:
df <- data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
results <- apply(df[-1], 1, FUN = function(x) mean(
head(sort(x, decreasing = TRUE), 3),
na.rm=TRUE))
names(results) <- df$name
results
結果應如下所示:
> results
A B C
3.333333 4.666667 5.000000
>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.