简体   繁体   中英

cut2 splits into unequal buckets

I am currently doing some data manipulation and have been searching for a way to create deciles with equal number of observations in each group. I ran into the Hmisc package and the cut2 function and was under the impression it should split the data into 10 buckets with equal numbers of observations in each by specifying g=10. However the output from this function has been quite a bit off. Am I using cut2 incorrectly?

The code I am using:

library(Hmisc)
testdata <- data.frame(rating= c(8, 8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  6,  8,  8,  8,  8,  6,  8,  6,  8,  4,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  6,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  6,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  6,  8,  8,  6,  4,  8,  8,  8,  8,  8,  6,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  2,  8,  6,  8,  8,  8,  6,  8,  8,  6,  6,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  6,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  6,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  6,  8,  8,  8,  6)
,age=c(0,   0,  0,  0,  3,  4,  4,  4,  4,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  9,  9,  9,  9,  10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 39, 39, 39, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 48, 48, 48, 54, 54, 54, 56, 56, 58, 59, 59, 59, 59, 60, 60, 60, 61, 66, 66, 70, 72))
cutcutcut <- cut2(testdata$age,g=10)
testtable <- table(cutcutcut)

and the output of unequal observations in each bucket

testtable

 [ 0,13) [13,15) [15,20) [20,24) [24,26) [26,28) [28,33) [33,40) [40,46) [46,72] 
 46      16      35      28      33      35      26      31      31      28 

The answer to your question lies in looking at the distribution of your data:

table(testdata$age)
#  0  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
#  4  1  4  6  4  3  4  2  2 16  9  7  5 10  6  7  7 13  4  2  9 
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
# 23 10 18 17  8  5  3  2  8  2  2  5  9  5  5  3  2  8  7  3  6 
# 45 46 47 48 54 56 58 59 60 61 66 70 72 
#  5  4  3  3  3  2  1  4  3  1  2  1  1 

We see that some ages have a large number of individuals at that age (eg there are 16 individuals with age 12 and 23 individuals with age 24). Since the cutting algorithm needs to put all individuals with the exact same age into the same bucket, this may lead to some imbalances in the buckets.

Since there are 309 total observations in your data and you seek 10 buckets, you would ideally want 31 observations in 9 of the buckets and 30 in the last. Right now the last bucket is defined as [46, 72] , which contains 28 elements (too low). If you expanded this to [45, 72] , it would contain 33 elements (too many). There is no way to split up the data to get exactly 30 or 31 observations in this last bucket because there are 5 elements with value 45.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM