[英]How to categorize a continuous variable in 4 groups of the same size in R?
I need to categorize a continuous variable in 4 classes each one with the same number of observations. 我需要将连续变量分为4个类,每个类具有相同数量的观察值。 I have used the function 我已经使用了功能
cut(x, breaks = quantile(x,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE))
My problem is that the number of observations in each category is not exactly the same because there are observations (and more than one) which have exactly the same value of the quantiles. 我的问题是,每个类别中观察值的数量并不完全相同,因为有一些观察值(不止一个)具有完全相同的分位数值。 How can I do it? 我该怎么做?
My variable is waiting 我的变量正在等待
[1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
[26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
[51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
[76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
[101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
[126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
[151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
[176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
[201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
[226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
[251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74
which is in the dataset faithful in R. It has 272 observations, therefore it is divisible by 4 giving 68 observations in each category. 在数据集中忠实于R。它具有272个观察值,因此可以被4除以给出每个类别中的68个观察值。
I have used 我用过
newwait<-cut(waiting, breaks =quantile(waiting,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE)
table(newwait)
newwait
[43,58) [58,76) [76,82) [82,96]
66 68 67 71
as you can see, the number of observations in each group is similar but not exactly the same. 如您所见,每个组中的观察次数相似但不完全相同。
Basically, it sounds like you need to deal with ties. 基本上,这听起来像您需要处理领带。 You also need to have a vector whose length, when divided by 4, yields an integer...but I'll assume you know that. 您还需要一个向量,将其长度除以4得出一个整数...但是我假设您知道这一点。
Here's a solution using the tie-breaking functions of rank
: 这是使用rank
的平局决胜功能的解决方案:
set.seed(1)
x <- round(runif(1000,0,1),1)
table(x)
## x
## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
## 43 106 95 103 112 109 82 102 95 100 53
y <- rank(x, ties.method='first') # <- this forces tie breaks
cuts <- cut(y, breaks = quantile(y,probs=seq(0,1,0.25)),
include.lowest=TRUE,
right=FALSE)
# check that cuts are all the same length:
lapply(split(x,cuts), length)
$`[1,251)`
[1] 250
$`[251,500)`
[1] 250
$`[500,750)`
[1] 250
$`[750,1e+03]`
[1] 250
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.