[英]Change factor levels conditional on the level frequency using dplyr
I have got the following data set: 我有以下数据集:
x = c(rep(0,600),rep(1,200), rep(2,100), rep(3,50), rep(4,20), rep(5,10), rep(6,10), rep(7,5), rep(8,5))
y = rbinom(1000,10,.5)
DATA = cbind(x, y)
Using 运用
t_x = table(x)
I obtain: 我得到:
x
0 1 2 3 4 5 6 7 8
600 200 100 50 20 10 10 5 5
As some of the levels are very rare I want to aggregate them such that each level is represented by at least 10% of the sample. 由于某些级别非常罕见,因此我希望将它们汇总,以使每个级别至少由样本的10%代表。 The desired outcome after calling table
on x
should be: 在x
上调用table
后,期望的结果应为:
x
0 1 2 "higher"
600 200 100 100
I have tried to use the following code: 我尝试使用以下代码:
DATA %>% mutate(x = if_else(t_x <= length(x) * .1, factor("higher", levels = c("higher", levels(x))),
factor(x)
))
but if_else
does not accept the t_x
. 但是if_else
不接受t_x
。
I would use cut
along the following lines: 我将按照以下方式使用cut
:
brks <- as.numeric(names(t_x[prop.table(t_x) >= 0.10]))
DATA %>%
as.data.frame() %>%
mutate(x.new = cut(x, breaks = c(-1, brks, max(x)))) %>%
pull(x.new) %>%
table()
#(-1,0] (0,1] (1,2] (2,8]
# 600 200 100 100
The resulting table
gives the number of entries per interval, eg 600
entries in group (-1, 0]
which corresponds to entries with value 0
, 200
entries in group (0, 1]
corresponding to entries with value 1
, and so on. 结果table
给出了每个间隔的条目数,例如,组(-1, 0]
600
个条目对应于值为0
条目,组(0, 1]
200
个条目对应于值为1
条目,依此类推。
Note that intervals are right-inclusive, ie for (x, y]
the value y
is included while x
is not, see ?cut
for details. 请注意,间隔是右包含的,即对于(x, y]
包含值y
,而x
不包含值,有关详细信息,请参见?cut
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.