简体   繁体   English

cut2分成不相等的存储桶

[英]cut2 splits into unequal buckets

I am currently doing some data manipulation and have been searching for a way to create deciles with equal number of observations in each group. 我目前正在做一些数据操作,并且一直在寻找一种方法来创建在每个组中具有相同数量观察值的十分位。 I ran into the Hmisc package and the cut2 function and was under the impression it should split the data into 10 buckets with equal numbers of observations in each by specifying g=10. 我遇到了Hmisc包和cut2函数,并且印象中应该通过指定g = 10将数据分成10个存储桶,每个存储桶中的观察值相等。 However the output from this function has been quite a bit off. 但是,此功能的输出已经关闭了很多。 Am I using cut2 incorrectly? 我是否正确使用了cut2?

The code I am using: 我正在使用的代码:

library(Hmisc)
testdata <- data.frame(rating= c(8, 8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  6,  8,  8,  8,  8,  6,  8,  6,  8,  4,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  6,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  6,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  6,  8,  8,  6,  4,  8,  8,  8,  8,  8,  6,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  2,  8,  6,  8,  8,  8,  6,  8,  8,  6,  6,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  6,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  4,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  6,  8,  8,  8,  6,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  6,  8,  8,  8,  8,  8,  6,  8,  8,  8,  6)
,age=c(0,   0,  0,  0,  3,  4,  4,  4,  4,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  9,  9,  9,  9,  10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 39, 39, 39, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 48, 48, 48, 54, 54, 54, 56, 56, 58, 59, 59, 59, 59, 60, 60, 60, 61, 66, 66, 70, 72))
cutcutcut <- cut2(testdata$age,g=10)
testtable <- table(cutcutcut)

and the output of unequal observations in each bucket 以及每个存储桶中不相等观测值的输出

testtable

 [ 0,13) [13,15) [15,20) [20,24) [24,26) [26,28) [28,33) [33,40) [40,46) [46,72] 
 46      16      35      28      33      35      26      31      31      28 

The answer to your question lies in looking at the distribution of your data: 您问题的答案在于查看数据的分布:

table(testdata$age)
#  0  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
#  4  1  4  6  4  3  4  2  2 16  9  7  5 10  6  7  7 13  4  2  9 
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
# 23 10 18 17  8  5  3  2  8  2  2  5  9  5  5  3  2  8  7  3  6 
# 45 46 47 48 54 56 58 59 60 61 66 70 72 
#  5  4  3  3  3  2  1  4  3  1  2  1  1 

We see that some ages have a large number of individuals at that age (eg there are 16 individuals with age 12 and 23 individuals with age 24). 我们看到某些年龄段的那个年龄段有很多人(例如,有16个12岁的个体和23个24岁的个体)。 Since the cutting algorithm needs to put all individuals with the exact same age into the same bucket, this may lead to some imbalances in the buckets. 由于切割算法需要将年龄完全相同的所有个体放入同一存储桶中,因此这可能导致存储桶中的某些失衡。

Since there are 309 total observations in your data and you seek 10 buckets, you would ideally want 31 observations in 9 of the buckets and 30 in the last. 由于您的数据中总共有309个观测值,并且您要查找10个存储桶,因此理想情况下,您需要9个存储桶中有31个观测值,最后一个存储桶中有30个观测值。 Right now the last bucket is defined as [46, 72] , which contains 28 elements (too low). 现在,最后一个存储桶定义为[46, 72] ,其中包含28个元素(太低)。 If you expanded this to [45, 72] , it would contain 33 elements (too many). 如果将其扩展为[45, 72] ,它将包含33个元素(太多)。 There is no way to split up the data to get exactly 30 or 31 observations in this last bucket because there are 5 elements with value 45. 由于有5个元素的值为45,因此无法拆分数据以在最后一个存储桶中准确获取30或31个观测值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM