[英]How to change all values expect for the top values (by frequency) from a categorical variable in R
I have a data frame in R which looks similar to the one below, with the factor variable "Genre":我在 R 中有一个数据框,它看起来类似于下面的数据框,因子变量为“流派”:
|Genre|Listening Time|
|Rock |1:05 |
|Pop |3:10 |
|RnB |4:12 |
|Rock |2:34 |
|Pop |5:01 |
|RnB |4:01 |
|Rock |1:34 |
|Pop |2:04 |
I want leave the top 15 genres (by count) as they are and only rename all other genres that are not among the top 15. Those should be renamed into the word "Other".我想保留前 15 种类型(按计数),只重命名不在前 15 名中的所有其他类型。那些应该重命名为“其他”这个词。
In other words - if for example the Genre "RnB" is not among the top 15 Genres, then it should be replaced by the word "Other".换句话说 - 例如,如果流派“RnB”不在前 15 个流派之列,则应将其替换为“其他”一词。
The table I would like to get would look like this then:我想得到的表看起来像这样:
|Genre|Listening Time|
|Rock |1:05 |
|Pop |3:10 |
|Other|4:12 |
|Rock |2:34 |
|Pop |5:01 |
|Other|4:01 |
|Rock |1:34 |
|Pop |2:04 |
How would I approach this?我将如何解决这个问题? Thank you!
谢谢!
If you want to look into tidyverse
you may do something like this.如果你想研究
tidyverse
,你可以做这样的事情。 I have tried to mimic your data frame but added some more rows.我试图模仿您的数据框,但添加了更多行。
You start with data > group_by Genre > order > chose top 5您从 data > group_by Genre > order > 选择前 5 名开始
library(tidyverse)
set.seed(1)
Data <- data.frame(
listen = format(as.POSIXlt(paste0(
as.character(sample(1:5)),
':',
as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
Genre = sample(c("Rock", "Pop", 'RnB'), 120, replace = TRUE)
)
Data %>%
group_by(Genre ) %>%
arrange(desc(listen)) %>%
select(listen) %>%
top_n(5) %>%
arrange(Genre)
#> Adding missing grouping variables: `Genre`
#> Selecting by listen
#> # A tibble: 15 x 2
#> # Groups: Genre [3]
#> Genre listen
#> <chr> <chr>
#> 1 Pop 05:47
#> 2 Pop 05:47
#> 3 Pop 05:43
#> 4 Pop 05:41
#> 5 Pop 05:28
#> 6 RnB 05:54
#> 7 RnB 05:44
#> 8 RnB 05:43
#> 9 RnB 05:29
#> 10 RnB 05:28
#> 11 Rock 05:54
#> 12 Rock 05:44
#> 13 Rock 05:41
#> 14 Rock 05:29
#> 15 Rock 05:26
Sorry, if I have misunderstood what you wanted.对不起,如果我误解了你想要的。 If you assign the code to a new data.frame and make an anti_join to the original DF and then mutate Genre to others it should be what you want - I guess.
如果您将代码分配给一个新的 data.frame 并对原始 DF 进行 anti_join 然后将 Genre 变异给其他人,它应该是您想要的 - 我猜。
df <- Data %>%
group_by(Genre ) %>%
arrange(desc(listen)) %>%
select(listen) %>%
top_n(5) %>%
arrange(Genre)
# make an anti_join and assign 'other' to Genre
anti_join(Data, df) %>%
mutate(Genre = 'others')
Next Edit下一个编辑
Hopefully I have now understood your question.希望我现在已经理解了你的问题。 You want just to count how often the Genres occure in your data and give those which do not belong to the top 15 the name Others .
您只想计算 Genres 在您的数据中出现的频率,并将不属于前 15 名的 Genres 命名为Others 。 Maybe I was mislead by the data frame you offered which shows only 3 Genres.
也许我被你提供的数据框误导了,它只显示了 3 个流派。 So I looked up in Wikipedia and added a few, invented some own Genres and used LETTERS to build up a DF with sufficient numbers of Genre.
所以我在Wikipedia中查找并添加了一些,发明了一些自己的流派,并使用 LETTERS 构建了一个具有足够数量流派的 DF。
With count(Genre)
the occurences of Genres are counted, and then arranged in descending order.使用
count(Genre)
计算 Genres 的出现次数,然后按降序排列。 I have then introduced a new column with the row numbers.然后我介绍了一个带有行号的新列。 You can delete this if you want, as it is only there to do the next step which is introducing another column - I have chosen to make a new column, instead of renaming all the names in Genre - with the name Top15 an giving every Genre which is on place(in row) 16 or later the name Others and keeping the rest unchanged.
如果需要,您可以删除它,因为它只用于执行下一步,即引入另一列 - 我选择创建一个新列,而不是重命名流派中的所有名称 - 名称Top15给出每个流派它位于(在行中)16 或更高的名称Others并保持 rest 不变。
head(20)
just prints the first 20 rows of this DF. head(20)
只打印此 DF 的前 20 行。
library(tidyverse)
set.seed(1)
Data <- data.frame(
listen = format(as.POSIXlt(paste0(
as.character(sample(1:5)),
':',
as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
Genre = sample(c("Rock", "Pop", 'RnB', 'Opera',
'Birthday Songs', 'HipHop',
'Chinese Songs', 'Napoli Lovesongs',
'Benga', 'Bongo', 'Kawito', 'Noise',
'County Blues','Mambo', 'Reggae',
LETTERS[0:24]), 300, replace = TRUE)
)
Data %>% count(Genre) %>%
arrange(desc(n)) %>%
mutate(place = row_number()) %>%
mutate(Top15 = ifelse(place > 15, 'Others', Genre)) %>%
head(20)
#> # A tibble: 20 x 4
#> Genre n place Top15
#> <chr> <int> <int> <chr>
#> 1 N 15 1 N
#> 2 T 13 2 T
#> 3 V 13 3 V
#> 4 K 12 4 K
#> 5 Rock 11 5 Rock
#> 6 X 11 6 X
#> 7 E 10 7 E
#> 8 W 10 8 W
#> 9 Benga 9 9 Benga
#> 10 County Blues 9 10 County Blues
#> 11 G 9 11 G
#> 12 J 9 12 J
#> 13 M 9 13 M
#> 14 Reggae 9 14 Reggae
#> 15 B 8 15 B
#> 16 D 8 16 Others
#> 17 I 8 17 Others
#> 18 P 8 18 Others
#> 19 R 8 19 Others
#> 20 S 8 20 Others
I hope this was what you were looking for我希望这就是你要找的
Try replacing df
with your data.frame
to check if you get the desired output:尝试用您的
data.frame
替换df
以检查您是否获得所需的 output:
df <- data.frame(Genre=sample(letters, 1000, replace=TRUE),
ListeningTime=runif(1000, 3, 5))
> head(df) Genre ListeningTime 1 j 3.437013 2 n 4.151121 3 p 3.109044 4 z 4.529619 5 h 4.043982 6 i 3.590463
freq <- table(df$Genre)
sorted <- sort(freq, decreasing=TRUE) # Sorted by frequency of df$Genre
> sorted dxoq r ugijfapbevnw c kmzlhtys 53 50 46 45 45 42 41 41 40 39 38 38 37 37 37 36 36 35 35 35 35 34 33 33 30 29
not_top_15 <- names(sorted[-1*1:15]) # The Genres not in the top 15
pos <- which(df$Genre %in% not_top_15) # Their position in df
> head(df[pos, ]) # The original data, without the top 15 Genres Genre ListeningTime 2 n 4.151121 4 z 4.529619 5 h 4.043982 7 s 3.521054 16 w 3.528091 18 h 4.588815
library(dplyr)
set.seed(123)
compute_listen_time <- function(n.songs) {
min <- sample(1:15, n.songs, replace = TRUE)
sec <- sample(0:59, n.songs, replace = TRUE)
sec <- ifelse(sec > 10, sec, paste0("0", sec))
paste0(min, ":", sec)
}
df <- data.frame(
Genre = sample(c("Rock", "Pop", "RnB", "Rock", "Pop"), 100, replace = TRUE),
Listen_Time = compute_listen_time(100)
)
df <- add_count(df, Genre, name = "count") %>%
mutate(
rank = dense_rank(desc(count)),
group = ifelse(rank <= 15, Genre, "other")
)
df
I can think of a data.table solution.我可以想到一个 data.table 解决方案。 Let's assume your data.frame is called
music
, then:假设您的 data.frame 被称为
music
,那么:
library(data.table)
setDT(music)
other_genres <- music[, .N, by = genre][order(-N)][16:.N, genre]
music[genre %chin% other_genres, genre := "other"]
The first line of effective code counts the appearances by genre, sorts it from largest to smallest and selects from the 16 down to the last one, assigning the result to a variable called other_genres
.第一行有效代码按流派计算出现次数,从大到小排序,从 16 到最后一个进行选择,将结果分配给名为
other_genres
的变量。 The second line will check which genres are in that list, and update their name to "other"
.第二行将检查该列表中的流派,并将它们的名称更新为
"other"
。
There is a pretty neat solution with the forcats
package applied here to the diamonds
dataset to only name the top 5 clarity
values and bundle the rest as "Other"有一个非常简洁的解决方案,在这里将
forcats
package 应用于diamonds
数据集,仅命名前 5 个clarity
值并将 rest 捆绑为“其他”
library(dplyr)
library(forcats)
diamonds %>%
mutate(clarity2 = fct_lump(fct_infreq(clarity), n = 5))
Result:结果:
# A tibble: 53,940 x 11
carat cut color clarity depth table price x y z clarity2
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 SI2
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 SI1
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 VS1
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 VS2
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 SI2
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 VVS2
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Other
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 SI1
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 VS2
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 VS1
# … with 53,930 more rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.