简体   繁体   中英

How to categorize a continuous variable in 4 groups of the same size in R?

I need to categorize a continuous variable in 4 classes each one with the same number of observations. I have used the function

cut(x, breaks = quantile(x,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE))

My problem is that the number of observations in each category is not exactly the same because there are observations (and more than one) which have exactly the same value of the quantiles. How can I do it?

My variable is waiting

[1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
[26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
[51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
[76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
[101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
[126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
[151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
[176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
[201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
[226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
[251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

which is in the dataset faithful in R. It has 272 observations, therefore it is divisible by 4 giving 68 observations in each category.

I have used

newwait<-cut(waiting, breaks =quantile(waiting,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE)

table(newwait)
newwait
[43,58) [58,76) [76,82) [82,96] 
     66      68      67      71 

as you can see, the number of observations in each group is similar but not exactly the same.

Basically, it sounds like you need to deal with ties. You also need to have a vector whose length, when divided by 4, yields an integer...but I'll assume you know that.

Here's a solution using the tie-breaking functions of rank :

set.seed(1)
x <- round(runif(1000,0,1),1)
table(x)
## x
##   0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 
##  43 106  95 103 112 109  82 102  95 100  53

y <- rank(x, ties.method='first') # <- this forces tie breaks
cuts <- cut(y, breaks = quantile(y,probs=seq(0,1,0.25)),
               include.lowest=TRUE,
               right=FALSE)
# check that cuts are all the same length:
lapply(split(x,cuts), length)
$`[1,251)`
[1] 250

$`[251,500)`
[1] 250

$`[500,750)`
[1] 250

$`[750,1e+03]`
[1] 250

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM