根据另一个分组变量以不同的方式剪切变量

Question

Example: I have a dataset of heights by gender. 示例：我有一个按性别划分的身高数据集。 I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender. 我想将高度分为低点和高点，其中切点定义为均值-每个性别内2sd。

example dataset: 示例数据集：

set.seed(8)
df = data.frame(sex = c(rep("M",100), rep("F",100)), 
                ht = c(rnorm(100, mean=1.7, sd=.17), rnorm(100, mean=1.6, sd=.16)))

I'd like to do something in a single line of vectorized code because I'm fairly sure that is possible, however, I do not know how to write it. 我想在一行矢量化代码中做某事，因为我相当确定这是可能的，但是，我不知道如何编写它。 I imagine that there may be a way to use cut() , apply() , and/or dplyr to achieve this. 我想可能会有一种方法可以使用cut() ， apply()和/或dplyr来实现。

Answer 1

How about this using cut from base R: 如何使用基数R中的cut ：

sapply(c("F", "M"), function(s){
    dfF <- df[df$sex==s,] # filter out per gender
    cut(dfF$ht, breaks = c(0, mean(dfF$ht)-2*sd(dfF$ht), Inf), labels = c("low", "high"))
})
# dfF$ht heights per gender
# mean(dfF$ht)-2*sd(dfF$ht) cut point

Answer 2

Just discovered the following solution using base r: 刚刚使用base r发现了以下解决方案：

df$ht_grp <- ave(x = df$ht, df$sex, 
                 FUN = function(x) 
                       cut(x, breaks = c(0, (mean(x, na.rm=T) - 2*sd(x, na.rm=T)), Inf)))

This works because I know that 0 and Inf are reasonable bounds, but I could also use min(x) , and max(x) as my upper and lower bounds. 之所以可行，是因为我知道0和Inf是合理的界限，但是我也可以使用min(x)和max(x)作为上限和下限。 This results in a factor variable that is split into low, high, and NA. 这导致因子变量分为低，高和NA。

My prior solution: I came up with the following two-step process which is not so bad: 我先前的解决方案：我提出了以下两步过程，效果还不错：

df = merge(df, 
           setNames( aggregate(ht ~ sex, df, FUN = function(x) mean(x)-2*sd(x)), 
                     c("sex", "ht_cutoff")), 
           by = "sex")

df$ht_is_low = ifelse(df$ht <= df$ht_cutoff, 1, 0)

Answer 3

In the code below, I created 2 new variables. 在下面的代码中，我创建了2个新变量。 Both were created by grouping the sex variable and filtering the different ranges of ht . 两者都是通过将sex变量分组并过滤ht的不同范围而创建的。

 library(dplyr)
 df_low <- df %>% group_by(sex) %>% filter(ht<(mean(ht)-2*sd(ht)))
 df_high<- df %>% group_by(sex) %>% filter(ht>(mean(ht)+2*sd(ht)))

根据另一个分组变量以不同的方式剪切变量

问题描述

3 个解决方案

解决方案1
1 2016-09-15 16:00:55

解决方案2
0 已采纳 2016-09-15 15:43:57

解决方案3
0 2016-09-16 02:50:44

根据另一个分组变量以不同的方式剪切变量

问题描述

3 个解决方案

解决方案1 1 2016-09-15 16:00:55

解决方案2 0 已采纳 2016-09-15 15:43:57

解决方案3 0 2016-09-16 02:50:44

解决方案1
1 2016-09-15 16:00:55

解决方案2
0 已采纳 2016-09-15 15:43:57

解决方案3
0 2016-09-16 02:50:44