简体   繁体   English

R中因子水平的分位数

[英]Quantiles by factor levels in R

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1 , for each level of a factor strata . 我有一个数据框,我正在尝试在数据框中创建一个新变量,该变量具有连续变量var1的分位数,用于因子strata每个级别。

# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
                  strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
                  )

# function to get quantiles
qfun <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE, labels = 1:q)
    quantile
}

I tried using two methods, neither of which produce a usable result. 我尝试使用两种方法,这两种方法都不会产生可用的结果。 Firstly, I tried using aggregate to apply qfun to each level of strata : 首先,我尝试使用aggregateqfun应用于每个级别的strata

qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))

This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (eg, using unlist does not line the new variable values up with the correct rows in the data frame). 这通过因子级别返回分位数,但输出很难强制回到数据帧中(例如,使用unlist不会将新变量值与数据帧中的正确行unlist )。

A second approach was to do this in steps: 第二种方法是分步执行:

tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3

Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. 同样,这会为每个因子级别正确计算分位数,但很明显,与aggregate一样,它们在数据帧中的顺序不正确。 We can check this by putting the quantile "bins" into the data frame. 我们可以通过将分位数“bins”放入数据框来检查这一点。

# get quantile bins
qfun2 <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE)
    quantile
}

tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33

Many of the values of var1 are outside of the bins of quantile2 . var1许多值都在quantile2的bin之外。 I feel like i'm missing something simple. 我觉得我错过了一些简单的事情。 Any suggestions would be greatly appreciated. 任何建议将不胜感激。

I think your issue is that you don't really want to aggregate, but use ave , (or data.table or plyr ) 我认为你的问题是你真的不想聚合,而是使用ave ,(或data.tableplyr

qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))

#using plyr
library(plyr)

qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))

#using data.table (my preference)


dat[, qq := qfun(var1), by = strata]

Aggregate usually implies returning an object that is smaller that the original. 聚合通常意味着返回一个小于原始对象的对象。 (inthis case you were getting a data.frame where x was a list of 1 element for each strata. (在这种情况下,你得到一个data.frame,其中x是每个层的1个元素的list

Use ave on your dat data frame. dat数据框上使用ave Full example with your simulated data and qfun function: 您的模拟数据和qfun函数的完整示例:

# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
              strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
              )

# function to get quantiles
qfun <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE, labels = 1:q)
    quantile
}

And my addition... 而我的补充......

dat$q <- ave(dat$var1,dat$strata,FUN=qfun)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM