[英]Finding the most common values by factors in R
我有一个包含约 700 万行、约 130 个无线电频道和约 130K 音乐家或乐队(以及许多变量)的广播节目数据框。 df 看起来像这样:
| Channel | Performer|
| --------| -------- |
| Radio1 | Rihanna |
| Radio1 | ACDC |
| Radio2 | Jay-Z |
| Radio3 | ACDC |
| Radio2 | Jay-Z |
| Radio1 | Rihanna |
| Radio2 | ACDC |
| Radio3 | Jay-Z |
| Radio1 | Rihanna |
| Radio1 | ACDC |
| Radio2 | Jay-Z |
| Radio3 | ACDC |
| Radio2 | Rihanna |
| Radio1 | Rihanna |
| Radio2 | ACDC |
| Radio1 | Jay-Z |
我想知道广播频道中最著名的 3 位表演者,以及播放了多少次并获得这样的表格(或枢轴或其他什么,只需获取信息):
|Channel|No1 Performer|No2 Performer|No3 Performer|No1 Plays|No2 Plays|No3 Plays|
|-------|-------------|-------------|-------------|---------|---------|---------|
|Radio1 |Rihanna |ACDC |Jay-Z |4 |2 |1 |
|Radio2 |Jay-Z |ACDC |Rihanna |3 |2 |1 |
|Radio3 |ACDC |Jay-Z |- |2 |1 |0 |
包dplyr
对这些数据操作很有帮助。
count
将通过将行折叠成它们的计数来总结数据框slice_max
将只保留每组前 3 名歌手的行。library(dplyr)
df |>
# Count instances
count(Channel, Performer) |>
group_by(Channel) |>
# Keep only the top 3 per channel
slice_max(order_by = n, n = 3)
如果你想重塑它,来自pivot_wider
的tidyr
可以为你做到这一点。
library(tidyverse)
df %>%
group_by(Channel, Performer) %>%
tally() %>%
slice_max(n, n=3) %>%
mutate(name = rank(-n, ties = 'first')) %>%
pivot_wider(Channel, values_from = c(Performer, n))
Channel Performer_1 Performer_2 Performer_3 n_1 n_2 n_3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Radio1 Rihanna ACDC Jay-Z 4 2 1
2 Radio2 Jay-Z ACDC Rihanna 3 2 1
3 Radio3 ACDC Jay-Z NA 2 1 NA
另一种解决方案,您可以结合使用n()
和rowid()
而不是tally()
)
library(tidyverse)
set.seed(4321)
example = data.frame(
Channel = sample(c('Radio1','Radio2','Radio3'),20,replace = TRUE),
Performer = sample(c('Rihanna','ACDC','Jay-Z'),20,replace = TRUE)
)
example
> example
Channel Performer
1 Radio1 Jay-Z
2 Radio2 Jay-Z
3 Radio3 ACDC
4 Radio2 Jay-Z
5 Radio1 Jay-Z
6 Radio1 Rihanna
7 Radio2 ACDC
8 Radio2 ACDC
9 Radio3 Rihanna
10 Radio1 ACDC
11 Radio3 Rihanna
12 Radio1 Rihanna
13 Radio2 Jay-Z
14 Radio2 Jay-Z
15 Radio2 ACDC
16 Radio3 Rihanna
17 Radio1 Jay-Z
18 Radio2 Jay-Z
19 Radio3 Rihanna
20 Radio1 ACDC
代码:
example %>%
group_by(Channel,Performer) %>%
summarise(times = n()) %>%
arrange(desc(times),.by_group=TRUE) %>%
slice_max(times, n=3) %>%
mutate(ranking = data.table::rowid(Channel,prefix = 'No'))
# A tibble: 7 x 4
# Groups: Channel [3]
Channel Performer times ranking
<chr> <chr> <int> <chr>
1 Radio1 Jay-Z 3 No1
2 Radio1 ACDC 2 No2
3 Radio1 Rihanna 2 No3
4 Radio2 Jay-Z 5 No1
5 Radio2 ACDC 3 No2
6 Radio3 Rihanna 4 No1
7 Radio3 ACDC 1 No2
如果要旋转,请添加:
pivot_wider(names_from = ranking, values_from = c(Performer, times))
输出:
# A tibble: 3 x 7
# Groups: Channel [3]
Channel Performer_No1 Performer_No2 Performer_No3 times_No1 times_No2 times_No3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Radio1 Jay-Z ACDC Rihanna 3 2 2
2 Radio2 Jay-Z ACDC NA 5 3 NA
3 Radio3 Rihanna ACDC NA 4 1 NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.