[英]Grouping the dataframe based on one variable
I have a dataframe with 10 variables all of them numeric, and one of the variable name is age, I want to group the observation based on age.example. 我有一个包含10个变量的数据框,所有变量都是数字变量,变量名称之一是age,我想根据age.example对观察结果进行分组。 age 17 to 18 one group, 19-22 another group and then each row should be attached to each group.
年龄在17至18岁的一组,另一组在19-22岁的年龄,然后每行应附加到每一组。 And resulting should be a dataframe for further manipulations.
结果应该是进一步操作的数据框。 Model of the dataframe:
数据框模型:
A B AGE
25 50 17
30 42 22
50 60 19
65 105 17
355 400 21
68 47 20
115 98 18
25 75 19
And I want result like 我想要像这样的结果
17-18
A B AGE
25 50 17
65 105 17
115 98 18
19-22
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
I did group the dataset according to Age var using the split function, now my concern is how I could manipulate the grouped data. 我确实使用split函数根据Age var对数据集进行了分组,现在我关心的是如何操作分组的数据。 Eg:the answer looked like
例如:答案看起来像
$1
A B AGE
25 50 17
65 105 17
115 98 18
$2
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
My question is how can I access each group for further manipulation? 我的问题是如何访问每个组进行进一步的操作? for eg: if I want to do t-test for each group separately?
例如:如果我想分别为每个组做t检验?
The split function will work with dataframes. split函数将与数据框一起使用。 Use either
cut
with 'breaks' or findInterval
with an appropriate set of cutpoints (named 'vec' if you are using named parameters) as the criterion for grouping, the second argument to split
. 使用带有'breaks'的
cut
或具有适当cut
点集合(如果使用命名参数,则命名为'vec')的findInterval
作为分组的标准,第二个参数为split
。 The default for cut
is intervals closed on the right and default for findInterval
is closed on the left. cut
的默认设置是在右侧关闭间隔,而findInterval
默认设置在左侧关闭。
> split(dat, findInterval(dat$AGE, c(17, 19.5, 22.5)))
$`1`
A B AGE
1 25 50 17
3 50 60 19
4 65 105 17
7 115 98 18
8 25 75 19
$`2`
A B AGE
2 30 42 22
5 355 400 21
6 68 47 20
Here is the approach with cut
这是
cut
的方法
lst <- split(df1, cut(df1$AGE, breaks=c(16, 18, 22), labels=FALSE))
lst
# $`1`
# A B AGE
#1 25 50 17
#4 65 105 17
#7 115 98 18
#$`2`
# A B AGE
#2 30 42 22
#3 50 60 19
#5 355 400 21
#6 68 47 20
#8 25 75 19
If you need to find the sum
, mean
of columns for each "list" element 如果您需要找到
sum
,则每个“列表”元素的列mean
lapply(lst, function(x) rbind(colSums(x[-3]),colMeans(x[-3])))
But, if the objective is to find the summary statistics based on the group, it can be done using any of the aggregating functions 但是,如果目标是根据组查找汇总统计信息,则可以使用任何汇总函数来完成
library(dplyr)
df1 %>%
group_by(grp=cut(AGE, breaks=c(16, 18, 22), labels=FALSE)) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), A:B)
# grp A_sum B_sum A_mean B_mean
#1 1 205 253 68.33333 84.33333
#2 2 528 624 105.60000 124.80000
Or using aggregate
from base R
或使用
base R
aggregate
do.call(data.frame,
aggregate(cbind(A,B)~cbind(grp=cut(AGE, breaks=c(16, 18, 22),
labels=FALSE)), df1, function(x) c(sum=sum(x), mean=mean(x))))
df1 <- structure(list(A = c(25L, 30L, 50L, 65L, 355L, 68L, 115L, 25L
), B = c(50L, 42L, 60L, 105L, 400L, 47L, 98L, 75L), AGE = c(17L,
22L, 19L, 17L, 21L, 20L, 18L, 19L)), .Names = c("A", "B", "AGE"
), class = "data.frame", row.names = c(NA, -8L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.