简体   繁体   中英

Finding the most common values by factors in R

I have a data frame of radio programms with ~7 million rows, ~130 radio channels and ~130K musicians or bands (and lots of variables). The df looks like this:

| Channel | Performer|
| --------| -------- |
| Radio1  | Rihanna  |
| Radio1  | ACDC     |
| Radio2  | Jay-Z    |
| Radio3  | ACDC     |
| Radio2  | Jay-Z    |
| Radio1  | Rihanna  |
| Radio2  | ACDC     |
| Radio3  | Jay-Z    |
| Radio1  | Rihanna  |
| Radio1  | ACDC     |
| Radio2  | Jay-Z    |
| Radio3  | ACDC     |
| Radio2  | Rihanna  |
| Radio1  | Rihanna  |
| Radio2  | ACDC     |
| Radio1  | Jay-Z    |

I would like to know which are the 3 most famous performers on the radio channels and how many times were played and get a table like this (or a pivot or whatever, just get the information):

|Channel|No1 Performer|No2 Performer|No3 Performer|No1 Plays|No2 Plays|No3 Plays|
|-------|-------------|-------------|-------------|---------|---------|---------|
|Radio1 |Rihanna      |ACDC         |Jay-Z        |4        |2        |1        |
|Radio2 |Jay-Z        |ACDC         |Rihanna      |3        |2        |1        |
|Radio3 |ACDC         |Jay-Z        |-            |2        |1        |0        |

Package dplyr is helpful for these data manipulations.

  • count will summarise the dataframe by collapsing the rows into their counts
  • slice_max will keep only the rows with the top 3 singers per group.
library(dplyr)

df |>
  # Count instances
  count(Channel, Performer) |> 
  group_by(Channel) |>
  # Keep only the top 3 per channel
  slice_max(order_by = n, n = 3)

If you want to reshape it, pivot_wider from tidyr can do that for you.

library(tidyverse)

df %>%
  group_by(Channel, Performer) %>%
  tally() %>%
  slice_max(n, n=3) %>%
  mutate(name =  rank(-n, ties = 'first')) %>%
  pivot_wider(Channel, values_from = c(Performer, n))

  Channel Performer_1 Performer_2 Performer_3   n_1   n_2   n_3
  <chr>   <chr>       <chr>       <chr>       <int> <int> <int>
1 Radio1  Rihanna     ACDC        Jay-Z           4     2     1
2 Radio2  Jay-Z       ACDC        Rihanna         3     2     1
3 Radio3  ACDC        Jay-Z       NA              2     1    NA

Another solution, instead of tally() you can combine n() and rowid()

library(tidyverse)

set.seed(4321)

example = data.frame(
  Channel = sample(c('Radio1','Radio2','Radio3'),20,replace = TRUE),
  Performer = sample(c('Rihanna','ACDC','Jay-Z'),20,replace = TRUE)
)

example
    > example
   Channel Performer
1   Radio1     Jay-Z
2   Radio2     Jay-Z
3   Radio3      ACDC
4   Radio2     Jay-Z
5   Radio1     Jay-Z
6   Radio1   Rihanna
7   Radio2      ACDC
8   Radio2      ACDC
9   Radio3   Rihanna
10  Radio1      ACDC
11  Radio3   Rihanna
12  Radio1   Rihanna
13  Radio2     Jay-Z
14  Radio2     Jay-Z
15  Radio2      ACDC
16  Radio3   Rihanna
17  Radio1     Jay-Z
18  Radio2     Jay-Z
19  Radio3   Rihanna
20  Radio1      ACDC

Code:

example %>% 
  group_by(Channel,Performer) %>% 
  summarise(times = n()) %>% 
  arrange(desc(times),.by_group=TRUE) %>% 
  slice_max(times, n=3) %>%
  mutate(ranking = data.table::rowid(Channel,prefix = 'No'))

# A tibble: 7 x 4
# Groups:   Channel [3]
  Channel Performer times ranking
  <chr>   <chr>     <int> <chr>  
1 Radio1  Jay-Z         3 No1    
2 Radio1  ACDC          2 No2    
3 Radio1  Rihanna       2 No3    
4 Radio2  Jay-Z         5 No1    
5 Radio2  ACDC          3 No2    
6 Radio3  Rihanna       4 No1    
7 Radio3  ACDC          1 No2  

If you want to pivot, add:

pivot_wider(names_from = ranking, values_from = c(Performer, times))

Output:

# A tibble: 3 x 7
# Groups:   Channel [3]
  Channel Performer_No1 Performer_No2 Performer_No3 times_No1 times_No2 times_No3
  <chr>   <chr>         <chr>         <chr>             <int>     <int>     <int>
1 Radio1  Jay-Z         ACDC          Rihanna               3         2         2
2 Radio2  Jay-Z         ACDC          NA                    5         3        NA
3 Radio3  Rihanna       ACDC          NA                    4         1        NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM