如何從 R 中的分類變量更改所有期望值（按頻率）

Question

我在 R 中有一個數據框，它看起來類似於下面的數據框，因子變量為“流派”：

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|RnB  |4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|RnB  |4:01          |
|Rock |1:34          |
|Pop  |2:04          |

我想保留前 15 種類型（按計數），只重命名不在前 15 名中的所有其他類型。那些應該重命名為“其他”這個詞。

換句話說 - 例如，如果流派“RnB”不在前 15 個流派之列，則應將其替換為“其他”一詞。

我想得到的表看起來像這樣：

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|Other|4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|Other|4:01          |
|Rock |1:34          |
|Pop  |2:04          |

我將如何解決這個問題？ 謝謝！

Answer 1

如果你想研究tidyverse ，你可以做這樣的事情。 我試圖模仿您的數據框，但添加了更多行。

您從 data > group_by Genre > order > 選擇前 5 名開始


library(tidyverse)

set.seed(1)
Data <- data.frame(
  listen = format(as.POSIXlt(paste0(
      as.character(sample(1:5)),
      ':',
      as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
  Genre = sample(c("Rock", "Pop", 'RnB'), 120, replace = TRUE)
)


Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre)
#> Adding missing grouping variables: `Genre`
#> Selecting by listen
#> # A tibble: 15 x 2
#> # Groups:   Genre [3]
#>    Genre listen
#>    <chr> <chr> 
#>  1 Pop   05:47 
#>  2 Pop   05:47 
#>  3 Pop   05:43 
#>  4 Pop   05:41 
#>  5 Pop   05:28 
#>  6 RnB   05:54 
#>  7 RnB   05:44 
#>  8 RnB   05:43 
#>  9 RnB   05:29 
#> 10 RnB   05:28 
#> 11 Rock  05:54 
#> 12 Rock  05:44 
#> 13 Rock  05:41 
#> 14 Rock  05:29 
#> 15 Rock  05:26

對不起，如果我誤解了你想要的。 如果您將代碼分配給一個新的 data.frame 並對原始 DF 進行 anti_join 然后將 Genre 變異給其他人，它應該是您想要的 - 我猜。

df <- Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre) 

# make an anti_join and assign 'other' to Genre

anti_join(Data, df) %>% 
  mutate(Genre = 'others')

下一個編輯

希望我現在已經理解了你的問題。 您只想計算 Genres 在您的數據中出現的頻率，並將不屬於前 15 名的 Genres 命名為Others 。 也許我被你提供的數據框誤導了，它只顯示了 3 個流派。 所以我在Wikipedia中查找並添加了一些，發明了一些自己的流派，並使用 LETTERS 構建了一個具有足夠數量流派的 DF。

使用count(Genre)計算 Genres 的出現次數，然后按降序排列。 然后我介紹了一個帶有行號的新列。 如果需要，您可以刪除它，因為它只用於執行下一步，即引入另一列 - 我選擇創建一個新列，而不是重命名流派中的所有名稱 - 名稱Top15給出每個流派它位於（在行中）16 或更高的名稱Others並保持 rest 不變。

head(20)只打印此 DF 的前 20 行。


library(tidyverse)

set.seed(1)
Data <- data.frame(
  listen = format(as.POSIXlt(paste0(
      as.character(sample(1:5)),
      ':',
      as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
  Genre = sample(c("Rock", "Pop", 'RnB', 'Opera',
                   'Birthday Songs', 'HipHop',
                   'Chinese Songs', 'Napoli Lovesongs',
                   'Benga', 'Bongo', 'Kawito', 'Noise',
                   'County Blues','Mambo', 'Reggae',
                   LETTERS[0:24]), 300, replace = TRUE)
)

Data %>% count(Genre) %>% 
  arrange(desc(n)) %>% 
  mutate(place = row_number()) %>% 
  mutate(Top15 = ifelse(place > 15, 'Others', Genre)) %>% 
  head(20)
#> # A tibble: 20 x 4
#>    Genre            n place Top15       
#>    <chr>        <int> <int> <chr>       
#>  1 N               15     1 N           
#>  2 T               13     2 T           
#>  3 V               13     3 V           
#>  4 K               12     4 K           
#>  5 Rock            11     5 Rock        
#>  6 X               11     6 X           
#>  7 E               10     7 E           
#>  8 W               10     8 W           
#>  9 Benga            9     9 Benga       
#> 10 County Blues     9    10 County Blues
#> 11 G                9    11 G           
#> 12 J                9    12 J           
#> 13 M                9    13 M           
#> 14 Reggae           9    14 Reggae      
#> 15 B                8    15 B           
#> 16 D                8    16 Others      
#> 17 I                8    17 Others      
#> 18 P                8    18 Others      
#> 19 R                8    19 Others      
#> 20 S                8    20 Others

我希望這就是你要找的

Answer 2

嘗試用您的data.frame替換df以檢查您是否獲得所需的 output：

df <- data.frame(Genre=sample(letters, 1000, replace=TRUE),
                 ListeningTime=runif(1000, 3, 5))

 > head(df) Genre ListeningTime 1 j 3.437013 2 n 4.151121 3 p 3.109044 4 z 4.529619 5 h 4.043982 6 i 3.590463

freq <- table(df$Genre)
sorted <- sort(freq, decreasing=TRUE)  # Sorted by frequency of df$Genre

 > sorted dxoq r ugijfapbevnw c kmzlhtys 53 50 46 45 45 42 41 41 40 39 38 38 37 37 37 36 36 35 35 35 35 34 33 33 30 29

not_top_15 <- names(sorted[-1*1:15])  # The Genres not in the top 15
pos <- which(df$Genre %in% not_top_15)  # Their position in df

 > head(df[pos, ]) # The original data, without the top 15 Genres Genre ListeningTime 2 n 4.151121 4 z 4.529619 5 h 4.043982 7 s 3.521054 16 w 3.528091 18 h 4.588815

Answer 3

library(dplyr)

set.seed(123)
compute_listen_time <- function(n.songs) {
  min <- sample(1:15, n.songs, replace = TRUE)
  sec <- sample(0:59, n.songs, replace = TRUE)
  sec <- ifelse(sec > 10, sec, paste0("0", sec))
  paste0(min, ":", sec)
}



df <- data.frame(
  Genre = sample(c("Rock", "Pop", "RnB", "Rock", "Pop"), 100, replace = TRUE),
  Listen_Time = compute_listen_time(100)
)


df <- add_count(df, Genre, name = "count") %>%
  mutate(
    rank = dense_rank(desc(count)),
    group = ifelse(rank <= 15, Genre, "other")
  )
df

Answer 4

我可以想到一個 data.table 解決方案。 假設您的 data.frame 被稱為music ，那么：

library(data.table)
setDT(music)

other_genres <- music[, .N, by = genre][order(-N)][16:.N, genre]

music[genre %chin% other_genres, genre := "other"]

第一行有效代碼按流派計算出現次數，從大到小排序，從 16 到最后一個進行選擇，將結果分配給名為other_genres的變量。 第二行將檢查該列表中的流派，並將它們的名稱更新為"other" 。

Answer 5

有一個非常簡潔的解決方案，在這里將forcats package 應用於diamonds數據集，僅命名前 5 個clarity值並將 rest 捆綁為“其他”

library(dplyr)
library(forcats)

diamonds %>%
  mutate(clarity2 = fct_lump(fct_infreq(clarity), n = 5))

結果：

# A tibble: 53,940 x 11
   carat cut       color clarity depth table price     x     y     z clarity2
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>   
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 SI2     
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 SI1     
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31 VS1     
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 VS2     
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75 SI2     
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 VVS2    
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Other   
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 SI1     
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 VS2     
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39 VS1     
# … with 53,930 more rows

如何從 R 中的分類變量更改所有期望值（按頻率）

問題描述

5 個解決方案

解決方案1
1 已采納 2020-06-17 17:16:27

解決方案2
0 2020-06-17 16:33:17

解決方案3
0 2020-06-17 17:53:27

解決方案4
0 2020-06-17 19:28:08

解決方案5
0 2020-06-18 10:49:04

如何從 R 中的分類變量更改所有期望值（按頻率）

問題描述

5 個解決方案

解決方案1 1 已采納 2020-06-17 17:16:27

解決方案2 0 2020-06-17 16:33:17

解決方案3 0 2020-06-17 17:53:27

解決方案4 0 2020-06-17 19:28:08

解決方案5 0 2020-06-18 10:49:04

解決方案1
1 已采納 2020-06-17 17:16:27

解決方案2
0 2020-06-17 16:33:17

解決方案3
0 2020-06-17 17:53:27

解決方案4
0 2020-06-17 19:28:08

解決方案5
0 2020-06-18 10:49:04