简体   繁体   English

在R中剪切函数以考虑重复出现的数据

[英]Cut function in R to account for reoccurring data

Say I have a range of data indicating the ages of individuals in years. 假设我有一系列数据表明年龄的个人年龄。 Such that 这样

ages <- sample(40:80, 30, replace = F)

Now I want to plot (boxplot) against another variable ~ say weight 现在我想绘制(boxplot)对另一个变量〜说重量

But I want to cut the ages samples into the following catergories <50, >50, >60. 但我想把年龄的样本分成以下几个<50,> 50,> 60的catergories。 >70. > 70。 So that an individuals weight, who is 66, will be used for both the >50 and >60 plots 因此,66岁的个体体重将用于> 50和> 60的情节

My understanding is I use the cut command 我的理解是我使用cut命令

age.category <- cut(ages, breaks = c(40, 50, 60, 70, 80) ........)

But how do I format to accounnt for repeated data, when I want 但是,如果需要,我如何格式化以重复数据

labels = c("x < 50", "x > 50", "x > 60", "x > 70")

Here's an alternative "multicut" function which would allow you to specify arbitrary breaks. 这是一个替代的“multicut”函数,它允许您指定任意中断。 This will repeat values, once for each group that value appears. 这将重复值,每个组出现一次值。 So if you had "64" it would output both 64, "x>50" and 64, "x>60" . 因此,如果你有“64”,它将输出64, "x>50"64, "x>60"

#sample data
set.seed(15)
ages <- sample(40:80, 30, replace = F)
weights <- 100 + ages*.2 + rnorm(30, 0 , 20)

#custom breaks
#named list, names-categories, values = 2-vector with min/max
breaks<-list(
   "x<50" =c(-Inf, 50),
   "x>50" =c(50, Inf),
   "x>60" = c(60, Inf),
   "x>70" = c(70, Inf)
)

Now we define the main helper function multicut 现在我们定义主辅助函数multicut

multicut <- function(x, breaks, vals=x, left.closed=TRUE, right.closed=FALSE, 
  x.name=if(missing(vals)) deparse(substitute(x)) else deparse(substitute(vals)),
  group.name="group") {

    unrowname <- function(x) {rownames(x)<-NULL; x}
    if (is.data.frame(vals)) {
        if(missing(x.name)) x.name<-names(vals)
        vals = Map(unrowname, split(vals, 1:nrow(vals)))
    }
    stopifnot(length(vals) == length(x))
    grp <- lapply(x, function(x) {
        mapply(function(z, br,l,r) {
            left<-if (l) z>=br[1] else z>br[1]
            right<-if (r) z<=br[2] else z<br[2]
            left & right
        }, x, breaks, left.closed, right.closed)
    })
    df <- do.call(rbind.data.frame, 
        Map(cbind.data.frame,  
        g=lapply(grp, function(z) if(any(z)) names(breaks)[z] else NA),
        x=vals))
    df[[1]] <- factor(df[[1]], levels=names(breaks))
    names(df) <- c(group.name, x.name)
    df
}

Now we use it on the sample data 现在我们在样本数据上使用它

dd <- multicut(ages, breaks, weights)
boxplot(weights~group, dd)

The three important parameters to multicut are x which contains the values you use to wish for categorization, breaks which is the named list of min/max values for each group, and optionally vals which is either a vector or data.frame you want to split based on x and breaks . 三个重要参数multicutx包含您使用奢望了分类的值, breaks这是每组最大/最小值的命名列表,可选vals既不是一个载体或data.frame要拆分基于xbreaks Here we want to use age to split up the weights . 在这里,我们希望使用age来分割weights

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM