简体   繁体   中英

Change factor levels conditional on the level frequency using dplyr

I have got the following data set:

x = c(rep(0,600),rep(1,200), rep(2,100), rep(3,50), rep(4,20), rep(5,10), rep(6,10), rep(7,5), rep(8,5))
y = rbinom(1000,10,.5)
DATA = cbind(x, y)

Using

t_x = table(x)

I obtain:

x
  0   1   2   3   4   5   6   7   8 
600 200 100  50  20  10  10   5   5 

As some of the levels are very rare I want to aggregate them such that each level is represented by at least 10% of the sample. The desired outcome after calling table on x should be:

x
  0   1   2  "higher" 
600 200 100      100

I have tried to use the following code:

DATA %>% mutate(x = if_else(t_x <= length(x) * .1, factor("higher", levels = c("higher", levels(x))),
            factor(x)
            ))

but if_else does not accept the t_x .

I would use cut along the following lines:

brks <- as.numeric(names(t_x[prop.table(t_x) >= 0.10]))
DATA %>%
    as.data.frame() %>%
    mutate(x.new = cut(x, breaks = c(-1, brks, max(x)))) %>%
    pull(x.new) %>%
    table()
#(-1,0]  (0,1]  (1,2]  (2,8]
#   600    200    100    100

The resulting table gives the number of entries per interval, eg 600 entries in group (-1, 0] which corresponds to entries with value 0 , 200 entries in group (0, 1] corresponding to entries with value 1 , and so on.

Note that intervals are right-inclusive, ie for (x, y] the value y is included while x is not, see ?cut for details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM