简体   繁体   English

如何从 R 中的分类变量更改所有期望值(按频率)

[英]How to change all values expect for the top values (by frequency) from a categorical variable in R

I have a data frame in R which looks similar to the one below, with the factor variable "Genre":我在 R 中有一个数据框,它看起来类似于下面的数据框,因子变量为“流派”:

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|RnB  |4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|RnB  |4:01          |
|Rock |1:34          |
|Pop  |2:04          |

I want leave the top 15 genres (by count) as they are and only rename all other genres that are not among the top 15. Those should be renamed into the word "Other".我想保留前 15 种类型(按计数),只重命名在前 15 名中的所有其他类型。那些应该重命名为“其他”这个词。

In other words - if for example the Genre "RnB" is not among the top 15 Genres, then it should be replaced by the word "Other".换句话说 - 例如,如果流派“RnB”不在前 15 个流派之列,则应将其替换为“其他”一词。

The table I would like to get would look like this then:我想得到的表看起来像这样:

|Genre|Listening Time|
|Rock |1:05          |
|Pop  |3:10          |
|Other|4:12          |
|Rock |2:34          |
|Pop  |5:01          |
|Other|4:01          |
|Rock |1:34          |
|Pop  |2:04          |

How would I approach this?我将如何解决这个问题? Thank you!谢谢!

If you want to look into tidyverse you may do something like this.如果你想研究tidyverse ,你可以做这样的事情。 I have tried to mimic your data frame but added some more rows.我试图模仿您的数据框,但添加了更多行。

You start with data > group_by Genre > order > chose top 5您从 data > group_by Genre > order > 选择前 5 名开始


library(tidyverse)

set.seed(1)
Data <- data.frame(
  listen = format(as.POSIXlt(paste0(
      as.character(sample(1:5)),
      ':',
      as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
  Genre = sample(c("Rock", "Pop", 'RnB'), 120, replace = TRUE)
)


Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre)
#> Adding missing grouping variables: `Genre`
#> Selecting by listen
#> # A tibble: 15 x 2
#> # Groups:   Genre [3]
#>    Genre listen
#>    <chr> <chr> 
#>  1 Pop   05:47 
#>  2 Pop   05:47 
#>  3 Pop   05:43 
#>  4 Pop   05:41 
#>  5 Pop   05:28 
#>  6 RnB   05:54 
#>  7 RnB   05:44 
#>  8 RnB   05:43 
#>  9 RnB   05:29 
#> 10 RnB   05:28 
#> 11 Rock  05:54 
#> 12 Rock  05:44 
#> 13 Rock  05:41 
#> 14 Rock  05:29 
#> 15 Rock  05:26

Sorry, if I have misunderstood what you wanted.对不起,如果我误解了你想要的。 If you assign the code to a new data.frame and make an anti_join to the original DF and then mutate Genre to others it should be what you want - I guess.如果您将代码分配给一个新的 data.frame 并对原始 DF 进行 anti_join 然后将 Genre 变异给其他人,它应该是您想要的 - 我猜。

df <- Data %>%
  group_by(Genre ) %>%
  arrange(desc(listen)) %>% 
  select(listen) %>% 
  top_n(5) %>% 
  arrange(Genre) 

# make an anti_join and assign 'other' to Genre

anti_join(Data, df) %>% 
  mutate(Genre = 'others')

Next Edit下一个编辑

Hopefully I have now understood your question.希望我现在已经理解了你的问题。 You want just to count how often the Genres occure in your data and give those which do not belong to the top 15 the name Others .您只想计算 Genres 在您的数据中出现的频率,并将不属于前 15 名的 Genres 命名为Others Maybe I was mislead by the data frame you offered which shows only 3 Genres.也许我被你提供的数据框误导了,它只显示了 3 个流派。 So I looked up in Wikipedia and added a few, invented some own Genres and used LETTERS to build up a DF with sufficient numbers of Genre.所以我在Wikipedia中查找并添加了一些,发明了一些自己的流派,并使用 LETTERS 构建了一个具有足够数量流派的 DF。

With count(Genre) the occurences of Genres are counted, and then arranged in descending order.使用count(Genre)计算 Genres 的出现次数,然后按降序排列。 I have then introduced a new column with the row numbers.然后我介绍了一个带有行号的新列。 You can delete this if you want, as it is only there to do the next step which is introducing another column - I have chosen to make a new column, instead of renaming all the names in Genre - with the name Top15 an giving every Genre which is on place(in row) 16 or later the name Others and keeping the rest unchanged.如果需要,您可以删除它,因为它只用于执行下一步,即引入另一列 - 我选择创建一个新列,而不是重命名流派中的所有名称 - 名称Top15给出每个流派它位于(在行中)16 或更高的名称Others并保持 rest 不变。

head(20) just prints the first 20 rows of this DF. head(20)只打印此 DF 的前 20 行。


library(tidyverse)

set.seed(1)
Data <- data.frame(
  listen = format(as.POSIXlt(paste0(
      as.character(sample(1:5)),
      ':',
      as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
  Genre = sample(c("Rock", "Pop", 'RnB', 'Opera',
                   'Birthday Songs', 'HipHop',
                   'Chinese Songs', 'Napoli Lovesongs',
                   'Benga', 'Bongo', 'Kawito', 'Noise',
                   'County Blues','Mambo', 'Reggae',
                   LETTERS[0:24]), 300, replace = TRUE)
)

Data %>% count(Genre) %>% 
  arrange(desc(n)) %>% 
  mutate(place = row_number()) %>% 
  mutate(Top15 = ifelse(place > 15, 'Others', Genre)) %>% 
  head(20)
#> # A tibble: 20 x 4
#>    Genre            n place Top15       
#>    <chr>        <int> <int> <chr>       
#>  1 N               15     1 N           
#>  2 T               13     2 T           
#>  3 V               13     3 V           
#>  4 K               12     4 K           
#>  5 Rock            11     5 Rock        
#>  6 X               11     6 X           
#>  7 E               10     7 E           
#>  8 W               10     8 W           
#>  9 Benga            9     9 Benga       
#> 10 County Blues     9    10 County Blues
#> 11 G                9    11 G           
#> 12 J                9    12 J           
#> 13 M                9    13 M           
#> 14 Reggae           9    14 Reggae      
#> 15 B                8    15 B           
#> 16 D                8    16 Others      
#> 17 I                8    17 Others      
#> 18 P                8    18 Others      
#> 19 R                8    19 Others      
#> 20 S                8    20 Others

I hope this was what you were looking for我希望这就是你要找的

Try replacing df with your data.frame to check if you get the desired output:尝试用您的data.frame替换df以检查您是否获得所需的 output:

df <- data.frame(Genre=sample(letters, 1000, replace=TRUE),
                 ListeningTime=runif(1000, 3, 5))
 > head(df) Genre ListeningTime 1 j 3.437013 2 n 4.151121 3 p 3.109044 4 z 4.529619 5 h 4.043982 6 i 3.590463
freq <- table(df$Genre)
sorted <- sort(freq, decreasing=TRUE)  # Sorted by frequency of df$Genre
 > sorted dxoq r ugijfapbevnw c kmzlhtys 53 50 46 45 45 42 41 41 40 39 38 38 37 37 37 36 36 35 35 35 35 34 33 33 30 29
not_top_15 <- names(sorted[-1*1:15])  # The Genres not in the top 15
pos <- which(df$Genre %in% not_top_15)  # Their position in df
 > head(df[pos, ]) # The original data, without the top 15 Genres Genre ListeningTime 2 n 4.151121 4 z 4.529619 5 h 4.043982 7 s 3.521054 16 w 3.528091 18 h 4.588815
library(dplyr)

set.seed(123)
compute_listen_time <- function(n.songs) {
  min <- sample(1:15, n.songs, replace = TRUE)
  sec <- sample(0:59, n.songs, replace = TRUE)
  sec <- ifelse(sec > 10, sec, paste0("0", sec))
  paste0(min, ":", sec)
}



df <- data.frame(
  Genre = sample(c("Rock", "Pop", "RnB", "Rock", "Pop"), 100, replace = TRUE),
  Listen_Time = compute_listen_time(100)
)


df <- add_count(df, Genre, name = "count") %>%
  mutate(
    rank = dense_rank(desc(count)),
    group = ifelse(rank <= 15, Genre, "other")
  )
df

I can think of a data.table solution.我可以想到一个 data.table 解决方案。 Let's assume your data.frame is called music , then:假设您的 data.frame 被称为music ,那么:

library(data.table)
setDT(music)

other_genres <- music[, .N, by = genre][order(-N)][16:.N, genre]

music[genre %chin% other_genres, genre := "other"]

The first line of effective code counts the appearances by genre, sorts it from largest to smallest and selects from the 16 down to the last one, assigning the result to a variable called other_genres .第一行有效代码按流派计算出现次数,从大到小排序,从 16 到最后一个进行选择,将结果分配给名为other_genres的变量。 The second line will check which genres are in that list, and update their name to "other" .第二行将检查该列表中的流派,并将它们的名称更新为"other"

There is a pretty neat solution with the forcats package applied here to the diamonds dataset to only name the top 5 clarity values and bundle the rest as "Other"有一个非常简洁的解决方案,在这里将forcats package 应用于diamonds数据集,仅命名前 5 个clarity值并将 rest 捆绑为“其他”

library(dplyr)
library(forcats)

diamonds %>%
  mutate(clarity2 = fct_lump(fct_infreq(clarity), n = 5))

Result:结果:

# A tibble: 53,940 x 11
   carat cut       color clarity depth table price     x     y     z clarity2
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>   
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 SI2     
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 SI1     
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31 VS1     
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 VS2     
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75 SI2     
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 VVS2    
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Other   
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 SI1     
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 VS2     
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39 VS1     
# … with 53,930 more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM