I have got the following data set:
x = c(rep(0,600),rep(1,200), rep(2,100), rep(3,50), rep(4,20), rep(5,10), rep(6,10), rep(7,5), rep(8,5))
y = rbinom(1000,10,.5)
DATA = cbind(x, y)
Using
t_x = table(x)
I obtain:
x
0 1 2 3 4 5 6 7 8
600 200 100 50 20 10 10 5 5
As some of the levels are very rare I want to aggregate them such that each level is represented by at least 10% of the sample. The desired outcome after calling table
on x
should be:
x
0 1 2 "higher"
600 200 100 100
I have tried to use the following code:
DATA %>% mutate(x = if_else(t_x <= length(x) * .1, factor("higher", levels = c("higher", levels(x))),
factor(x)
))
but if_else
does not accept the t_x
.
I would use cut
along the following lines:
brks <- as.numeric(names(t_x[prop.table(t_x) >= 0.10]))
DATA %>%
as.data.frame() %>%
mutate(x.new = cut(x, breaks = c(-1, brks, max(x)))) %>%
pull(x.new) %>%
table()
#(-1,0] (0,1] (1,2] (2,8]
# 600 200 100 100
The resulting table
gives the number of entries per interval, eg 600
entries in group (-1, 0]
which corresponds to entries with value 0
, 200
entries in group (0, 1]
corresponding to entries with value 1
, and so on.
Note that intervals are right-inclusive, ie for (x, y]
the value y
is included while x
is not, see ?cut
for details.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.