[英]Cut function in R to account for reoccurring data
Say I have a range of data indicating the ages of individuals in years. 假设我有一系列数据表明年龄的个人年龄。 Such that
这样
ages <- sample(40:80, 30, replace = F)
Now I want to plot (boxplot) against another variable ~ say weight 现在我想绘制(boxplot)对另一个变量〜说重量
But I want to cut the ages samples into the following catergories <50, >50, >60. 但我想把年龄的样本分成以下几个<50,> 50,> 60的catergories。 >70.
> 70。 So that an individuals weight, who is 66, will be used for both the >50 and >60 plots
因此,66岁的个体体重将用于> 50和> 60的情节
My understanding is I use the cut
command 我的理解是我使用
cut
命令
age.category <- cut(ages, breaks = c(40, 50, 60, 70, 80) ........)
But how do I format to accounnt for repeated data, when I want 但是,如果需要,我如何格式化以重复数据
labels = c("x < 50", "x > 50", "x > 60", "x > 70")
Here's an alternative "multicut" function which would allow you to specify arbitrary breaks. 这是一个替代的“multicut”函数,它允许您指定任意中断。 This will repeat values, once for each group that value appears.
这将重复值,每个组出现一次值。 So if you had "64" it would output both
64, "x>50"
and 64, "x>60"
. 因此,如果你有“64”,它将输出
64, "x>50"
和64, "x>60"
。
#sample data
set.seed(15)
ages <- sample(40:80, 30, replace = F)
weights <- 100 + ages*.2 + rnorm(30, 0 , 20)
#custom breaks
#named list, names-categories, values = 2-vector with min/max
breaks<-list(
"x<50" =c(-Inf, 50),
"x>50" =c(50, Inf),
"x>60" = c(60, Inf),
"x>70" = c(70, Inf)
)
Now we define the main helper function multicut
现在我们定义主辅助函数
multicut
multicut <- function(x, breaks, vals=x, left.closed=TRUE, right.closed=FALSE,
x.name=if(missing(vals)) deparse(substitute(x)) else deparse(substitute(vals)),
group.name="group") {
unrowname <- function(x) {rownames(x)<-NULL; x}
if (is.data.frame(vals)) {
if(missing(x.name)) x.name<-names(vals)
vals = Map(unrowname, split(vals, 1:nrow(vals)))
}
stopifnot(length(vals) == length(x))
grp <- lapply(x, function(x) {
mapply(function(z, br,l,r) {
left<-if (l) z>=br[1] else z>br[1]
right<-if (r) z<=br[2] else z<br[2]
left & right
}, x, breaks, left.closed, right.closed)
})
df <- do.call(rbind.data.frame,
Map(cbind.data.frame,
g=lapply(grp, function(z) if(any(z)) names(breaks)[z] else NA),
x=vals))
df[[1]] <- factor(df[[1]], levels=names(breaks))
names(df) <- c(group.name, x.name)
df
}
Now we use it on the sample data 现在我们在样本数据上使用它
dd <- multicut(ages, breaks, weights)
boxplot(weights~group, dd)
The three important parameters to multicut
are x
which contains the values you use to wish for categorization, breaks
which is the named list of min/max values for each group, and optionally vals
which is either a vector or data.frame you want to split based on x
and breaks
. 三个重要参数
multicut
是x
包含您使用奢望了分类的值, breaks
这是每组最大/最小值的命名列表,可选vals
既不是一个载体或data.frame要拆分基于x
和breaks
。 Here we want to use age
to split up the weights
. 在这里,我们希望使用
age
来分割weights
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.