[英]R dplyr: Add column in group_by to count number of males/females
I have this dataframe:我有这个 dataframe:
treatment hh_id hh_size sex yob g2000 g2002 g2004 p2000
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Civic Duty 1 2 1 1941 1 1 1 0
2 Civic Duty 1 2 1 1947 1 1 1 0
3 Hawthorne 2 3 1 1951 1 1 1 0
4 Hawthorne 2 3 1 1950 1 1 1 0
5 Hawthorne 2 3 1 1982 1 1 1 0
6 Control 3 3 1 1981 0 0 1 0
7 Control 3 3 1 1959 1 1 1 0
8 Control 3 3 1 1956 1 1 1 0
9 Control 4 2 1 1968 0 0 1 0
10 Control 4 2 1 1967 1 1 1 0
I want to group it by hh_id & treatment and summarize the rest of the columns by their mean.我想按 hh_id 和处理对其进行分组,并按其平均值总结列的 rest。
Except, I also want two other columns to count the number of males and females in each household , where in the "sex" column female == 1
and male == 0
.除此之外,我还想要另外两列来计算每个家庭中男性和女性的数量,其中在“性别”列中female == 1
和male == 0
。
Here's what I have so far:这是我到目前为止所拥有的:
households <- df %>%
mutate_if(is.character, factor) %>%
group_by(hh_id, treatment) %>%
summarise_if(is.numeric, mean)
View(households)
which gives me this dataframe:这给了我这个 dataframe:
hh_id treatment hh_size sex yob g2000 g2002 g2004 p2000
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Civic Duty 2 1 1944 1 1 1 0
2 2 Hawthorne 3 1 1961 1 1 1 0
3 3 Control 3 1 1965. 0.667 0.667 1 0
4 4 Control 2 1 1968. 0.5 0.5 1 0
5 5 Control 1 1 1941 1 1 1 0
6 6 Hawthorne 2 1 1947 1 1 1 0
7 7 Control 1 1 1969 1 0 1 0
8 8 Control 2 1 1964 1 1 1 0.5
9 9 Self 2 1 1956 0.5 0.5 1 0
10 10 Control 1 1 1943 1 1 1 0
Instead of summarise_if
, use summarise
with across
(which is much more flexible).而不是summarise_if
,使用summarise
with across
(这更灵活)。 Also, the _if/_at/_all
are deprecated此外,不推荐使用_if/_at/_all
library(dplyr)
df1 %>%
group_by(hh_id, treatment) %>%
summarise(across(where(is.numeric), mean),
n_female = sum(sex == 1), n_male = sum(sex == 0))
The flexibility is that, we can pass multiple set of columns with difference functions in across
as well as computation on a single column without across
灵活的地方在于,我们可以在交叉中传递多组具有不同功能的列,也可以across
不交叉的情况下对单个列across
计算
df1 <- structure(list(treatment = c("Civic Duty", "Civic Duty", "Hawthorne",
"Hawthorne", "Hawthorne", "Control", "Control", "Control", "Control",
"Control"), hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L),
hh_size = c(2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L), sex = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), yob = c(1941L, 1947L,
1951L, 1950L, 1982L, 1981L, 1959L, 1956L, 1968L, 1967L),
g2000 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2002 = c(1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2004 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), p2000 = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.