简体   繁体   中英

Quantiles by factor levels in R

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1 , for each level of a factor strata .

# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
                  strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
                  )

# function to get quantiles
qfun <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE, labels = 1:q)
    quantile
}

I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate to apply qfun to each level of strata :

qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))

This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (eg, using unlist does not line the new variable values up with the correct rows in the data frame).

A second approach was to do this in steps:

tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3

Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.

# get quantile bins
qfun2 <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE)
    quantile
}

tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33

Many of the values of var1 are outside of the bins of quantile2 . I feel like i'm missing something simple. Any suggestions would be greatly appreciated.

I think your issue is that you don't really want to aggregate, but use ave , (or data.table or plyr )

qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))

#using plyr
library(plyr)

qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))

#using data.table (my preference)


dat[, qq := qfun(var1), by = strata]

Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x was a list of 1 element for each strata.

Use ave on your dat data frame. Full example with your simulated data and qfun function:

# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
              strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
              )

# function to get quantiles
qfun <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE, labels = 1:q)
    quantile
}

And my addition...

dat$q <- ave(dat$var1,dat$strata,FUN=qfun)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM