[英]Quantiles by factor levels in R
I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1
, for each level of a factor strata
. 我有一个数据框,我正在尝试在数据框中创建一个新变量,该变量具有连续变量
var1
的分位数,用于因子strata
每个级别。
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
I tried using two methods, neither of which produce a usable result. 我尝试使用两种方法,这两种方法都不会产生可用的结果。 Firstly, I tried using
aggregate
to apply qfun
to each level of strata
: 首先,我尝试使用
aggregate
将qfun
应用于每个级别的strata
:
qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))
This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (eg, using unlist
does not line the new variable values up with the correct rows in the data frame). 这通过因子级别返回分位数,但输出很难强制回到数据帧中(例如,使用
unlist
不会将新变量值与数据帧中的正确行unlist
)。
A second approach was to do this in steps: 第二种方法是分步执行:
tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3
Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate
they aren't in the correct order in the data frame. 同样,这会为每个因子级别正确计算分位数,但很明显,与
aggregate
一样,它们在数据帧中的顺序不正确。 We can check this by putting the quantile "bins" into the data frame. 我们可以通过将分位数“bins”放入数据框来检查这一点。
# get quantile bins
qfun2 <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE)
quantile
}
tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33
Many of the values of var1
are outside of the bins of quantile2
. var1
许多值都在quantile2
的bin之外。 I feel like i'm missing something simple. 我觉得我错过了一些简单的事情。 Any suggestions would be greatly appreciated.
任何建议将不胜感激。
I think your issue is that you don't really want to aggregate, but use ave
, (or data.table
or plyr
) 我认为你的问题是你真的不想聚合,而是使用
ave
,(或data.table
或plyr
)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. 聚合通常意味着返回一个小于原始对象的对象。 (inthis case you were getting a data.frame where
x
was a list
of 1 element for each strata. (在这种情况下,你得到一个data.frame,其中
x
是每个层的1个元素的list
。
Use ave
on your dat
data frame. 在
dat
数据框上使用ave
。 Full example with your simulated data and qfun
function: 您的模拟数据和
qfun
函数的完整示例:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition... 而我的补充......
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.