简体   繁体   中英

Split dataframe column into quantile with no duplication of quantile for any value in R

I have a data frame defined as below:

  structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA, 
-10L)) 

I want to split column 'value' into 'n' quantile (let say n=3) such that any value should not fall into 2 quantile. For ex: value '2' should get unique quantile

I tried using 'ntile' function as below

df1 <- mutate(df,R_rank=ntile(df$value,3))

Result is:

structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5), R_rank = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), class = "data.frame", row.names = c(NA, 
-10L))

Here value '2' is falling into 2 different quantile (1 and 2) but I want any value should fall into unique quantile.

How can I do this in R ?

Maybe the solution is to set the quantile argument type to a value other than the default value type = 7 .

n <- 3

q5 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 5)
q6 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 6)
q8 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 8)
q9 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 9)

And to split the input vector:

split(V, findInterval(V, q5))
split(V, findInterval(V, q6))
split(V, findInterval(V, q8))
split(V, findInterval(V, q9))

The split instructions above all give the same results. See below.

The values 5, 6, 8 and 9 were found with the following code:

sapply(1:9, function(i)
  quantile(V, probs = seq(0, 1, length.out = n + 1), type = i)
)
#          [,1] [,2] [,3] [,4]     [,5]     [,6] [,7]     [,8]     [,9]
#0%           1    1    1    1 1.000000 1.000000    1 1.000000 1.000000
#33.33333%    2    2    2    2 2.000000 2.000000    2 2.000000 2.000000
#66.66667%    2    2    2    2 2.166667 2.333333    2 2.222222 2.208333
#100%         5    5    5    5 5.000000 5.000000    5 5.000000 5.000000

As columns 5, 6, 8 and 9 have the 2/3 quantiles different from 2 , those types can be chosen to address the question.

The 2/3 quantiles are all between 2 and 3, that's the reason why the split instructions all output the same list.

Probably you can use cut

cut(df$value, 3, labels = FALSE)
#[1] 1 1 1 1 1 1 1 2 3 3

where

df$value #is
#[1] 1 1 2 2 2 2 2 3 4 5

So 1-2 fall into group 1, 3 falls into group 2 and 4-5 in group3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM