繁体   English   中英

通过 R 中的因子查找最常见的值

[英]Finding the most common values by factors in R

我有一个包含约 700 万行、约 130 个无线电频道和约 130K 音乐家或乐队(以及许多变量)的广播节目数据框。 df 看起来像这样:

| Channel | Performer|
| --------| -------- |
| Radio1  | Rihanna  |
| Radio1  | ACDC     |
| Radio2  | Jay-Z    |
| Radio3  | ACDC     |
| Radio2  | Jay-Z    |
| Radio1  | Rihanna  |
| Radio2  | ACDC     |
| Radio3  | Jay-Z    |
| Radio1  | Rihanna  |
| Radio1  | ACDC     |
| Radio2  | Jay-Z    |
| Radio3  | ACDC     |
| Radio2  | Rihanna  |
| Radio1  | Rihanna  |
| Radio2  | ACDC     |
| Radio1  | Jay-Z    |

我想知道广播频道中最著名的 3 位表演者,以及播放了多少次并获得这样的表格(或枢轴或其他什么,只需获取信息):

|Channel|No1 Performer|No2 Performer|No3 Performer|No1 Plays|No2 Plays|No3 Plays|
|-------|-------------|-------------|-------------|---------|---------|---------|
|Radio1 |Rihanna      |ACDC         |Jay-Z        |4        |2        |1        |
|Radio2 |Jay-Z        |ACDC         |Rihanna      |3        |2        |1        |
|Radio3 |ACDC         |Jay-Z        |-            |2        |1        |0        |

dplyr对这些数据操作很有帮助。

  • count将通过将行折叠成它们的计数来总结数据框
  • slice_max将只保留每组前 3 名歌手的行。
library(dplyr)

df |>
  # Count instances
  count(Channel, Performer) |> 
  group_by(Channel) |>
  # Keep only the top 3 per channel
  slice_max(order_by = n, n = 3)

如果你想重塑它,来自pivot_widertidyr可以为你做到这一点。

library(tidyverse)

df %>%
  group_by(Channel, Performer) %>%
  tally() %>%
  slice_max(n, n=3) %>%
  mutate(name =  rank(-n, ties = 'first')) %>%
  pivot_wider(Channel, values_from = c(Performer, n))

  Channel Performer_1 Performer_2 Performer_3   n_1   n_2   n_3
  <chr>   <chr>       <chr>       <chr>       <int> <int> <int>
1 Radio1  Rihanna     ACDC        Jay-Z           4     2     1
2 Radio2  Jay-Z       ACDC        Rihanna         3     2     1
3 Radio3  ACDC        Jay-Z       NA              2     1    NA

另一种解决方案,您可以结合使用n()rowid()而不是tally() )

library(tidyverse)

set.seed(4321)

example = data.frame(
  Channel = sample(c('Radio1','Radio2','Radio3'),20,replace = TRUE),
  Performer = sample(c('Rihanna','ACDC','Jay-Z'),20,replace = TRUE)
)

example
    > example
   Channel Performer
1   Radio1     Jay-Z
2   Radio2     Jay-Z
3   Radio3      ACDC
4   Radio2     Jay-Z
5   Radio1     Jay-Z
6   Radio1   Rihanna
7   Radio2      ACDC
8   Radio2      ACDC
9   Radio3   Rihanna
10  Radio1      ACDC
11  Radio3   Rihanna
12  Radio1   Rihanna
13  Radio2     Jay-Z
14  Radio2     Jay-Z
15  Radio2      ACDC
16  Radio3   Rihanna
17  Radio1     Jay-Z
18  Radio2     Jay-Z
19  Radio3   Rihanna
20  Radio1      ACDC

代码:

example %>% 
  group_by(Channel,Performer) %>% 
  summarise(times = n()) %>% 
  arrange(desc(times),.by_group=TRUE) %>% 
  slice_max(times, n=3) %>%
  mutate(ranking = data.table::rowid(Channel,prefix = 'No'))

# A tibble: 7 x 4
# Groups:   Channel [3]
  Channel Performer times ranking
  <chr>   <chr>     <int> <chr>  
1 Radio1  Jay-Z         3 No1    
2 Radio1  ACDC          2 No2    
3 Radio1  Rihanna       2 No3    
4 Radio2  Jay-Z         5 No1    
5 Radio2  ACDC          3 No2    
6 Radio3  Rihanna       4 No1    
7 Radio3  ACDC          1 No2  

如果要旋转,请添加:

pivot_wider(names_from = ranking, values_from = c(Performer, times))

输出:

# A tibble: 3 x 7
# Groups:   Channel [3]
  Channel Performer_No1 Performer_No2 Performer_No3 times_No1 times_No2 times_No3
  <chr>   <chr>         <chr>         <chr>             <int>     <int>     <int>
1 Radio1  Jay-Z         ACDC          Rihanna               3         2         2
2 Radio2  Jay-Z         ACDC          NA                    5         3        NA
3 Radio3  Rihanna       ACDC          NA                    4         1        NA

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM