基於計數的數據幀中所有因子變量的折疊因子水平

Question

我只想根據頻率保留前2個因素水平，並將所有其他因素歸類為“其他”。 我試過了，但沒有幫助。

df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))), 
              b=as.factor(c(rep('A',5),rep('B',5))), 
              c=as.factor(c(rep('A',3),rep('B',5),rep('C',2)))) 

myfun=function(x){
    if(is.factor(x)){
        levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'  
    }
}

df=as.data.frame(lapply(df, myfun))

預期產量

       a b      c
       D A      A
       D A      A
       D A      A
       B A      B
       B A      B
       B B      B
       B B      B
       B B      B
  others B others
  others B others

Answer 1

這可能會有點混亂，但這是通過基數R的一種方法，

fun1 <- function(x){levels(x) <- 
                    c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                    rep('others', length(levels(x))-2)); 
                    return(x)}

但是，上述功能需要首先重新排序，並且當OP在注釋中指出時，正確的功能應該是，

fun1 <- function(x){ x=factor(x, 
                     levels = names(sort(table(x), decreasing = TRUE))); 
                     levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                     rep('others', length(levels(x))-2)); 
                     return(x) }

Answer 2

由於使用了forcats軟件包中的fct_lump() ，現在這很容易。

fct_lump(df$a, n = 2)

# [1] D     D     D     B     B     B     B     B     Other Other
# Levels: B D Other

參數n控制要保留的最常見級別的數量，將其他級別合並在一起。

基於計數的數據幀中所有因子變量的折疊因子水平

問題描述

2 個解決方案

解決方案1
2 已采納 2016-08-05 12:31:05

解決方案2
2 2016-10-09 19:27:01

基於計數的數據幀中所有因子變量的折疊因子水平

問題描述

2 個解決方案

解決方案1 2 已采納 2016-08-05 12:31:05

解決方案2 2 2016-10-09 19:27:01

解決方案1
2 已采納 2016-08-05 12:31:05

解決方案2
2 2016-10-09 19:27:01