简体   繁体   中英

Error when binning data using `cut` in R

I am trying to bin a variable with value between 1 to 100,000 into ten groups by 10,000. I am using the following code and getting an error.

cut(x, breaks = quantile(x, probs=seq(0, 100000, 10000)), include.lowest = TRUE)

What am I doing wrong?

Well, at first I saw this as a typo, but after some discussion in comments I decided to write an answer.

The error occurs to quantile , as probs should be between 0 and 1 (read ?quantile ).


It looks like you have been confused with the following two:

cut(x, breaks = seq(0, 100000, 10000), include.lowest = TRUE)
cut(x, breaks = quantile(x, prob = seq(0, 1, 0.1)), include.lowest = TRUE)

As I said, they will give different result, especially when your data are not uniformly distributed.

As a representative example, consider non-uniformly distributed data, say Beta distributed:

set.seed(0)
x <- rbeta(10000, 3, 5)

b1 <- seq(0, 1, 0.1)

b2 <- quantile(x, prob = seq(0, 1, 0.1), names = FALSE)
round(b2, 2)
# [1] 0.01 0.17 0.23 0.28 0.32 0.37 0.41 0.46 0.52 0.60 0.94

Note, the difference between b2 and b1 are significant. You can inspect the (empirical) quantile-quantile plot:

plot(b1, b2); abline(0, 1)

You will see the dots deviates strongly from the line.

In above, b1 gives uniform bin cells, while b2 gives ragged bin cells. Now consider bin counts:

table(cut(x, breaks = b1, include.lowest = TRUE))
#  [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] 
#      256      1239      2011      2242      1948      1323       685       245 
#(0.8,0.9]   (0.9,1] 
#       48         3 

table(cut(x, breaks = b2, include.lowest = TRUE))
#[0.0101,0.169]  (0.169,0.228]  (0.228,0.276]  (0.276,0.321]  (0.321,0.365] 
#          1000           1000           1000           1000           1000 
# (0.365,0.412]  (0.412,0.463]  (0.463,0.519]  (0.519,0.598]  (0.598,0.935] 
#          1000           1000           1000           1000           1000 

Have you seen the difference? If we place break points by quantile, we will have uniform counts over bins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM