简体   繁体   English

根据一个变量对数据框进行分组

[英]Grouping the dataframe based on one variable

I have a dataframe with 10 variables all of them numeric, and one of the variable name is age, I want to group the observation based on age.example. 我有一个包含10个变量的数据框,所有变量都是数字变量,变量名称之一是age,我想根据age.example对观察结果进行分组。 age 17 to 18 one group, 19-22 another group and then each row should be attached to each group. 年龄在17至18岁的一组,另一组在19-22岁的年龄,然后每行应附加到每一组。 And resulting should be a dataframe for further manipulations. 结果应该是进一步操作的数据框。 Model of the dataframe: 数据框模型:

A   B   AGE
25  50  17
30  42  22
50  60  19
65  105 17
355 400 21
68  47  20
115 98  18
25  75  19

And I want result like 我想要像这样的结果

17-18 
A   B   AGE
25  50  17
65  105 17
115 98  18

19-22
A   B   AGE
30  42  22
50  60  19
355 400 21
68  47  20
115 98  18
25  75  19

I did group the dataset according to Age var using the split function, now my concern is how I could manipulate the grouped data. 我确实使用split函数根据Age var对数据集进行了分组,现在我关心的是如何操作分组的数据。 Eg:the answer looked like 例如:答案看起来像

$1

  A   B   AGE
  25  50  17
  65  105 17
  115 98  18

$2
A   B   AGE
    30  42  22
    50  60  19
    355 400 21
    68  47  20
    115 98  18
    25  75  19

My question is how can I access each group for further manipulation? 我的问题是如何访问每个组进行进一步的操作? for eg: if I want to do t-test for each group separately? 例如:如果我想分别为每个组做t检验?

The split function will work with dataframes. split函数将与数据框一起使用。 Use either cut with 'breaks' or findInterval with an appropriate set of cutpoints (named 'vec' if you are using named parameters) as the criterion for grouping, the second argument to split . 使用带有'breaks'的cut或具有适当cut点集合(如果使用命名参数,则命名为'vec')的findInterval作为分组的标准,第二个参数为split The default for cut is intervals closed on the right and default for findInterval is closed on the left. cut的默认设置是在右侧关闭间隔,而findInterval默认设置在左侧关闭。

> split(dat, findInterval(dat$AGE, c(17, 19.5, 22.5)))
$`1`
    A   B AGE
1  25  50  17
3  50  60  19
4  65 105  17
7 115  98  18
8  25  75  19

$`2`
    A   B AGE
2  30  42  22
5 355 400  21
6  68  47  20

Here is the approach with cut 这是cut的方法

lst <- split(df1, cut(df1$AGE, breaks=c(16, 18, 22), labels=FALSE))
lst
# $`1`
#   A   B AGE
#1  25  50  17
#4  65 105  17
#7 115  98  18

#$`2`
#   A   B AGE
#2  30  42  22
#3  50  60  19
#5 355 400  21
#6  68  47  20
#8  25  75  19

Update 更新资料

If you need to find the sum , mean of columns for each "list" element 如果您需要找到sum ,则每个“列表”元素的列mean

lapply(lst, function(x) rbind(colSums(x[-3]),colMeans(x[-3])))

But, if the objective is to find the summary statistics based on the group, it can be done using any of the aggregating functions 但是,如果目标是根据组查找汇总统计信息,则可以使用任何汇总函数来完成

 library(dplyr)
 df1 %>% 
     group_by(grp=cut(AGE, breaks=c(16, 18, 22), labels=FALSE)) %>% 
     summarise_each(funs(sum=sum(., na.rm=TRUE),
                      mean=mean(., na.rm=TRUE)), A:B)
 #   grp A_sum B_sum    A_mean    B_mean
 #1   1   205   253  68.33333  84.33333
 #2   2   528   624 105.60000 124.80000

Or using aggregate from base R 或使用base R aggregate

 do.call(data.frame,
   aggregate(cbind(A,B)~cbind(grp=cut(AGE, breaks=c(16, 18, 22), 
    labels=FALSE)), df1, function(x) c(sum=sum(x), mean=mean(x))))

data 数据

df1 <- structure(list(A = c(25L, 30L, 50L, 65L, 355L, 68L, 115L, 25L
), B = c(50L, 42L, 60L, 105L, 400L, 47L, 98L, 75L), AGE = c(17L, 
22L, 19L, 17L, 21L, 20L, 18L, 19L)), .Names = c("A", "B", "AGE"
), class = "data.frame", row.names = c(NA, -8L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM