简体   繁体   English

使用dplyr以水平频率为条件更改因子水平

[英]Change factor levels conditional on the level frequency using dplyr

I have got the following data set: 我有以下数据集:

x = c(rep(0,600),rep(1,200), rep(2,100), rep(3,50), rep(4,20), rep(5,10), rep(6,10), rep(7,5), rep(8,5))
y = rbinom(1000,10,.5)
DATA = cbind(x, y)

Using 运用

t_x = table(x)

I obtain: 我得到:

x
  0   1   2   3   4   5   6   7   8 
600 200 100  50  20  10  10   5   5 

As some of the levels are very rare I want to aggregate them such that each level is represented by at least 10% of the sample. 由于某些级别非常罕见,因此我希望将它们汇总,以使每个级别至少由样本的10%代表。 The desired outcome after calling table on x should be: x上调用table后,期望的结果应为:

x
  0   1   2  "higher" 
600 200 100      100

I have tried to use the following code: 我尝试使用以下代码:

DATA %>% mutate(x = if_else(t_x <= length(x) * .1, factor("higher", levels = c("higher", levels(x))),
            factor(x)
            ))

but if_else does not accept the t_x . 但是if_else不接受t_x

I would use cut along the following lines: 我将按照以下方式使用cut

brks <- as.numeric(names(t_x[prop.table(t_x) >= 0.10]))
DATA %>%
    as.data.frame() %>%
    mutate(x.new = cut(x, breaks = c(-1, brks, max(x)))) %>%
    pull(x.new) %>%
    table()
#(-1,0]  (0,1]  (1,2]  (2,8]
#   600    200    100    100

The resulting table gives the number of entries per interval, eg 600 entries in group (-1, 0] which corresponds to entries with value 0 , 200 entries in group (0, 1] corresponding to entries with value 1 , and so on. 结果table给出了每个间隔的条目数,例如,组(-1, 0] 600个条目对应于值为0条目,组(0, 1] 200个条目对应于值为1条目,依此类推。

Note that intervals are right-inclusive, ie for (x, y] the value y is included while x is not, see ?cut for details. 请注意,间隔是右包含的,即对于(x, y]包含值y ,而x不包含值,有关详细信息,请参见?cut

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用dplyr创建频率表以计算因子水平和缺失值并报告 - To create a frequency table with dplyr to count the factor levels and missing values and report it 使用dplyr过滤R中的因子水平 - Filter factor levels in R using dplyr 使用 dplyr 有条件地替换因子变量的水平 - Conditionally replace levels of factor variable using dplyr 在数据框中使用两个因子名称和水平顺序的变量来更改R中的因子水平 - Change factor levels in R using a variable for BOTH factor name AND level order in a data frame 使用dplyr基于重复值在条件因子级别汇总的拆分数据帧 - Split data frame conditional on factor level summarise based on duplicated values using dplyr 使用现有因子水平有条件地更改某些行中的值,可能在 dplyr - Conditionally change values in some rows using existing factor levels, possibly in dplyr 对于一个因子的所有级别,请使用dplyr从同一数据帧返回另一个因子的所有级别。 [R - For all levels of a factor, return all levels of another factor from same dataframe - using dplyr ? r 使用Dplyr过滤3个以上级别的因素时出现错误消息 - Error message when using Dplyr to filter with more than 3 levels to a factor 使用dplyr在因子水平上建立回归模型:重复出现错误 - Regression model over factor levels using dplyr : getting repeated errors 在因子水平上混淆并与dplyr变异 - Confused on factor levels and mutating with dplyr
相关标签
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM