简体   繁体   English


[英]Collapsing factor level for all the factor variable in dataframe based on the count

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. 我只想根据频率保留前2个因素水平,并将所有其他因素归类为“其他”。 I tried this but it doesnt help. 我试过了,但没有帮助。


        levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'  

df=as.data.frame(lapply(df, myfun))

Expected Output 预期产量

       a b      c
       D A      A
       D A      A
       D A      A
       B A      B
       B A      B
       B B      B
       B B      B
       B B      B
  others B others
  others B others

This might get a bit messy, but here is one approach via base R, 这可能会有点混乱,但这是通过基数R的一种方法,

fun1 <- function(x){levels(x) <- 
                    c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                    rep('others', length(levels(x))-2)); 

However the above function will need to first be re-ordered and as OP states in comment, the correct one will be, 但是,上述功能需要首先重新排序,并且当OP在注释中指出时,正确的功能应该是,

fun1 <- function(x){ x=factor(x, 
                     levels = names(sort(table(x), decreasing = TRUE))); 
                     levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                     rep('others', length(levels(x))-2)); 
                     return(x) } 

This is now easy thanks to fct_lump() from the forcats package. 由于使用了forcats软件包中的fct_lump() ,现在这很容易。

fct_lump(df$a, n = 2)

# [1] D     D     D     B     B     B     B     B     Other Other
# Levels: B D Other

The argument n controls the number of most common levels to be preserved, lumping together the others. 参数n控制要保留的最常见级别的数量,将其他级别合并在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM