简体   繁体   中英

Using cut2 from Hmisc to calculate cuts for different number of groups

I was trying to calculate equal quantile cuts for a vector by using cut2 from Hmisc.

library(Hmisc)
c <- c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,
       -1.83124,-1.74953,-1.74858,-0.63265,-0.59626,-0.5681)

cut2(c, g=3, onlycuts=TRUE)

[1] -4.18304 -2.01892 -1.74858 -0.56810

But I was expecting the following result (33%, 33%, 33%):

[1] -4.18304 -2.13478 -1.74858 -0.56810

Should I still use cut2 or try something different? How can I make it work? Thanks for your advice.

You are seeing the cutpoints, but you want the tabular counts, and you want them as fractions of the total, so do this instead:

> prop.table(table(cut2(c, g=3) ) )

[-4.18,-2.019) [-2.02,-1.749) [-1.75,-0.568] 
     0.3846154      0.3076923      0.3076923 

(Obviously you cannot expect cut2 to create an exact split when the count of elements was not evenly divisible by 3.)

It seems that there were accidentally thirteen values in the original data set, instead of twelve. Thirteen values cannot be equally divided into three quantile groups (as mentioned by BondedDust). Here is the original problem, except that one selected data value (-1.74953) is excluded, making it twelve values. This gives the result originally expected:

library(Hmisc)

c<-c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,-1.83124,-1.74858,-0.63265,-0.59626,-0.5681)

cut2(c, g=3,onlycuts=TRUE)
#[1] -4.18304 -2.13478 -1.74953 -0.5681


To make it clearer to anyone not familiar with cut2 from the Hmisc package (like me as of this morning), here's a similar problem, except that we'll use the integers 1 through 12 (assigned to the vector dozen_values ).

library(Hmisc)

dozen_values <-1:12

quantile_groups <- cut2(dozen_values,g=3)

levels(quantile_groups)
## [1] "[1, 5)" "[5, 9)" "[9,12]"

cutpoints <- cut2(dozen_values, g=3, onlycuts=TRUE)

cutpoints
## [1]  1  5  9 12

# Show which values belong to which quantile group, using a data frame
quantile_DF <- data.frame(dozen_values, quantile_groups)
names(quantile_DF) <- c("value", "quantile_group")

quantile_DF
##    value quantile_group
## 1      1         [1, 5)
## 2      2         [1, 5)
## 3      3         [1, 5)
## 4      4         [1, 5)
## 5      5         [5, 9)
## 6      6         [5, 9)
## 7      7         [5, 9)
## 8      8         [5, 9)
## 9      9         [9,12]
## 10    10         [9,12]
## 11    11         [9,12]
## 12    12         [9,12]

Notice that, the first quantile group includes everything up to, but not including , 5 (ie 1 thorough 4, in this case). The second quantile group contains 5 up to, but not including , 9 (ie 5 through 8, in this case). The third (last) quantile group contains 9 through 12, which includes the last value 12. Unlike the other quantile groups, the third quantile group includes the last value shown.

Anyway, you can see that the "cutpoints" 1 , 5 , 9 , and 12 describe the start and end points of the quantile groups in the most concise way, but it is obtuse without reading relevant documentation (link to single page Inside-R site, instead of the almost 400 page PDF manual).

See this explanation about the parentheses vs square bracket notation, if it is unfamiliar to you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM