[英]Summarize by group and across multiple columns in R
I have a dataframe including some base info of students.我有一个数据框,其中包含学生的一些基本信息。 I want to get summary statistics about Age, Sex and Group.
我想获得关于年龄、性别和群体的汇总统计数据。
set.seed(500)
testdf <- data.frame(ID = paste0("Stu", c(1:10)),
Age = sample(18:25, 10, replace = T),
Sex = sample(c("Boy", "Girl", "NA"), 10, replace = T),
Name = c("Pwyll","Flavian","Leehi","Zuzana","Aniya","Bogomil"
,"Lameez","Prudencia","Ikuo","Grayson"),
GroupMath = sample(LETTERS[1:2], 10, replace = T),
GroupEng = sample(LETTERS[1:2], 10, replace = T),
GroupScie = sample(LETTERS[1:2], 10, replace = T),
GroupChine = sample(LETTERS[1:2], 10, replace = T))
I want it to look like this picture.(Desired output.)我希望它看起来像这张照片。(所需的输出。)
And in my code I use three parts to deal with GroupMath, then GroupEng, then GroupScie, then GroupChine.在我的代码中,我使用三个部分来处理 GroupMath,然后是 GroupEng,然后是 GroupScie,然后是 GroupChine。 Does anyone know how I can make this more efficient?
有谁知道我怎样才能提高效率? I can't thank you enough.
我怎么感谢你都不为过。
N.math <- testdf %>% group_by(GroupMath) %>% count(GroupMath)
Age.math <- testdf %>% group_by(GroupMath) %>% summarize(
Mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age))
Sex.math <- testdf %>% group_by(GroupMath) %>% count(Sex)
Here is at least one way to simplify your summary stats so you are not aggregating each group one-by-one.这里至少有一种方法可以简化您的汇总统计数据,这样您就不会一个一个地汇总每个组。 First, you can pivot your data to long format using the groups and class as your target variable, then summarise data for these groups.
首先,您可以使用组和类作为目标变量将数据转换为长格式,然后汇总这些组的数据。 First, the pivot:
首先,支点:
#### Load Tidyverse ####
library(tidyverse)
#### Pivot to Long Format ####
groups <- testdf %>%
pivot_longer(cols = contains("Group"),
names_to = "Class",
values_to = "Group")
groups
Which looks like this:看起来像这样:
# A tibble: 40 × 6
ID Age Sex Name Class Group
<chr> <int> <chr> <chr> <chr> <chr>
1 Stu1 24 Girl Pwyll GroupMath B
2 Stu1 24 Girl Pwyll GroupEng B
3 Stu1 24 Girl Pwyll GroupScie A
4 Stu1 24 Girl Pwyll GroupChine B
5 Stu2 20 Girl Flavian GroupMath B
6 Stu2 20 Girl Flavian GroupEng A
7 Stu2 20 Girl Flavian GroupScie A
8 Stu2 20 Girl Flavian GroupChine B
9 Stu3 24 NA Leehi GroupMath A
10 Stu3 24 NA Leehi GroupEng B
Then you can aggregate the data using class, group, and sex:然后您可以使用类、组和性别聚合数据:
#### Aggregate Data by Class x Group ####
sums <- groups %>%
group_by(Class,Group,Sex) %>%
summarise(
Mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)) %>%
ungroup()
sums
Shown below.如下所示。 Notice that some values are NA because there is only one person per gender in some cases, so there can be no standard deviation in this case:
请注意,某些值是 NA,因为在某些情况下每个性别只有一个人,因此在这种情况下不存在标准偏差:
# A tibble: 16 × 7
Class Group Sex Mean Max Min sd
<chr> <chr> <chr> <dbl> <int> <int> <dbl>
1 GroupChine A Boy 24 24 24 NA
2 GroupChine A Girl 22.3 24 21 1.53
3 GroupChine B Girl 20 24 18 2.35
4 GroupChine B NA 24 24 24 NA
5 GroupEng A Boy 24 24 24 NA
6 GroupEng A Girl 20.5 24 19 2.38
7 GroupEng B Girl 21.2 24 18 2.5
8 GroupEng B NA 24 24 24 NA
9 GroupMath A Girl 20.8 24 19 2.36
10 GroupMath A NA 24 24 24 NA
11 GroupMath B Boy 24 24 24 NA
12 GroupMath B Girl 21 24 18 2.58
13 GroupScie A Girl 21.7 24 20 2.08
14 GroupScie A NA 24 24 24 NA
15 GroupScie B Boy 24 24 24 NA
16 GroupScie B Girl 20.4 24 18 2.51
Then you can get gender counts like so:然后你可以像这样得到性别计数:
#### Get Grouped Gender Counts ####
sex <- groups %>%
group_by(Class,Group) %>%
count(Sex) %>%
ungroup()
sex
Which looks like this:看起来像这样:
# A tibble: 16 × 4
Class Group Sex n
<chr> <chr> <chr> <int>
1 GroupChine A Boy 1
2 GroupChine A Girl 3
3 GroupChine B Girl 5
4 GroupChine B NA 1
5 GroupEng A Boy 1
6 GroupEng A Girl 4
7 GroupEng B Girl 4
8 GroupEng B NA 1
9 GroupMath A Girl 4
10 GroupMath A NA 1
11 GroupMath B Boy 1
12 GroupMath B Girl 4
13 GroupScie A Girl 3
14 GroupScie A NA 1
15 GroupScie B Boy 1
16 GroupScie B Girl 5
Finally you can join these two data frames in this way:最后,您可以通过这种方式加入这两个数据框:
#### Join ####
sums %>%
right_join(sex)
Giving you the final product.给你最终的产品。 You can see now where the NA values come from, such as Row 1 which only has 1 boy included, making SD impossible to evaluate:
您现在可以看到 NA 值的来源,例如仅包含 1 个男孩的第 1 行,使得 SD 无法评估:
Joining, by = c("Class", "Group", "Sex")
# A tibble: 16 × 8
Class Group Sex Mean Max Min sd n
<chr> <chr> <chr> <dbl> <int> <int> <dbl> <int>
1 GroupChine A Boy 24 24 24 NA 1
2 GroupChine A Girl 22.3 24 21 1.53 3
3 GroupChine B Girl 20 24 18 2.35 5
4 GroupChine B NA 24 24 24 NA 1
5 GroupEng A Boy 24 24 24 NA 1
6 GroupEng A Girl 20.5 24 19 2.38 4
7 GroupEng B Girl 21.2 24 18 2.5 4
8 GroupEng B NA 24 24 24 NA 1
9 GroupMath A Girl 20.8 24 19 2.36 4
10 GroupMath A NA 24 24 24 NA 1
11 GroupMath B Boy 24 24 24 NA 1
12 GroupMath B Girl 21 24 18 2.58 4
13 GroupScie A Girl 21.7 24 20 2.08 3
14 GroupScie A NA 24 24 24 NA 1
15 GroupScie B Boy 24 24 24 NA 1
16 GroupScie B Girl 20.4 24 18 2.51 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.