[英]cut2 splits into unequal buckets
我目前正在做一些數據操作,並且一直在尋找一種方法來創建在每個組中具有相同數量觀察值的十分位。 我遇到了Hmisc包和cut2函數,並且印象中應該通過指定g = 10將數據分成10個存儲桶,每個存儲桶中的觀察值相等。 但是,此功能的輸出已經關閉了很多。 我是否正確使用了cut2?
我正在使用的代碼:
library(Hmisc)
testdata <- data.frame(rating= c(8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 6, 8, 8, 8, 8, 6, 8, 6, 8, 4, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 6, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 6, 8, 8, 6, 4, 8, 8, 8, 8, 8, 6, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2, 8, 6, 8, 8, 8, 6, 8, 8, 6, 6, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 6, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 6, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 6, 8, 8, 8, 6)
,age=c(0, 0, 0, 0, 3, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 39, 39, 39, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 48, 48, 48, 54, 54, 54, 56, 56, 58, 59, 59, 59, 59, 60, 60, 60, 61, 66, 66, 70, 72))
cutcutcut <- cut2(testdata$age,g=10)
testtable <- table(cutcutcut)
以及每個存儲桶中不相等觀測值的輸出
testtable
[ 0,13) [13,15) [15,20) [20,24) [24,26) [26,28) [28,33) [33,40) [40,46) [46,72]
46 16 35 28 33 35 26 31 31 28
您問題的答案在於查看數據的分布:
table(testdata$age)
# 0 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 4 1 4 6 4 3 4 2 2 16 9 7 5 10 6 7 7 13 4 2 9
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
# 23 10 18 17 8 5 3 2 8 2 2 5 9 5 5 3 2 8 7 3 6
# 45 46 47 48 54 56 58 59 60 61 66 70 72
# 5 4 3 3 3 2 1 4 3 1 2 1 1
我們看到某些年齡段的那個年齡段有很多人(例如,有16個12歲的個體和23個24歲的個體)。 由於切割算法需要將年齡完全相同的所有個體放入同一存儲桶中,因此這可能導致存儲桶中的某些失衡。
由於您的數據中總共有309個觀測值,並且您要查找10個存儲桶,因此理想情況下,您需要9個存儲桶中有31個觀測值,最后一個存儲桶中有30個觀測值。 現在,最后一個存儲桶定義為[46, 72]
,其中包含28個元素(太低)。 如果將其擴展為[45, 72]
,它將包含33個元素(太多)。 由於有5個元素的值為45,因此無法拆分數據以在最后一個存儲桶中准確獲取30或31個觀測值。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.