I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesnt help.
df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),
b=as.factor(c(rep('A',5),rep('B',5))),
c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))
myfun=function(x){
if(is.factor(x)){
levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'
}
}
df=as.data.frame(lapply(df, myfun))
Expected Output
a b c
D A A
D A A
D A A
B A B
B A B
B B B
B B B
B B B
others B others
others B others
This might get a bit messy, but here is one approach via base R,
fun1 <- function(x){levels(x) <-
c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x)}
However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,
fun1 <- function(x){ x=factor(x,
levels = names(sort(table(x), decreasing = TRUE)));
levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x) }
This is now easy thanks to fct_lump()
from the forcats
package.
fct_lump(df$a, n = 2)
# [1] D D D B B B B B Other Other
# Levels: B D Other
The argument n
controls the number of most common levels to be preserved, lumping together the others.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.