使用dplyr以水平频率为条件更改因子水平

Question

I have got the following data set: 我有以下数据集：

x = c(rep(0,600),rep(1,200), rep(2,100), rep(3,50), rep(4,20), rep(5,10), rep(6,10), rep(7,5), rep(8,5))
y = rbinom(1000,10,.5)
DATA = cbind(x, y)

Using 运用

t_x = table(x)

I obtain: 我得到：

x
  0   1   2   3   4   5   6   7   8 
600 200 100  50  20  10  10   5   5

As some of the levels are very rare I want to aggregate them such that each level is represented by at least 10% of the sample. 由于某些级别非常罕见，因此我希望将它们汇总，以使每个级别至少由样本的10％代表。 The desired outcome after calling table on x should be: 在x上调用table后，期望的结果应为：

x
  0   1   2  "higher" 
600 200 100      100

I have tried to use the following code: 我尝试使用以下代码：

DATA %>% mutate(x = if_else(t_x <= length(x) * .1, factor("higher", levels = c("higher", levels(x))),
            factor(x)
            ))

but if_else does not accept the t_x . 但是if_else不接受t_x 。

Answer 1

I would use cut along the following lines: 我将按照以下方式使用cut ：

brks <- as.numeric(names(t_x[prop.table(t_x) >= 0.10]))
DATA %>%
    as.data.frame() %>%
    mutate(x.new = cut(x, breaks = c(-1, brks, max(x)))) %>%
    pull(x.new) %>%
    table()
#(-1,0]  (0,1]  (1,2]  (2,8]
#   600    200    100    100

The resulting table gives the number of entries per interval, eg 600 entries in group (-1, 0] which corresponds to entries with value 0 , 200 entries in group (0, 1] corresponding to entries with value 1 , and so on. 结果table给出了每个间隔的条目数，例如，组(-1, 0] 600个条目对应于值为0条目，组(0, 1] 200个条目对应于值为1条目，依此类推。

Note that intervals are right-inclusive, ie for (x, y] the value y is included while x is not, see ?cut for details. 请注意，间隔是右包含的，即对于(x, y]包含值y ，而x不包含值，有关详细信息，请参见?cut 。

使用dplyr以水平频率为条件更改因子水平

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-10-28 13:33:33

使用dplyr以水平频率为条件更改因子水平

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-10-28 13:33:33

解决方案1
0 已采纳 2018-10-28 13:33:33