简体   繁体   中英

R dplyr: Add column in group_by to count number of males/females

I have this dataframe:

treatment  hh_id hh_size   sex   yob g2000 g2002 g2004 p2000
   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Civic Duty     1       2     1  1941     1     1     1     0
 2 Civic Duty     1       2     1  1947     1     1     1     0
 3 Hawthorne      2       3     1  1951     1     1     1     0
 4 Hawthorne      2       3     1  1950     1     1     1     0
 5 Hawthorne      2       3     1  1982     1     1     1     0
 6 Control        3       3     1  1981     0     0     1     0
 7 Control        3       3     1  1959     1     1     1     0
 8 Control        3       3     1  1956     1     1     1     0
 9 Control        4       2     1  1968     0     0     1     0
10 Control        4       2     1  1967     1     1     1     0

I want to group it by hh_id & treatment and summarize the rest of the columns by their mean.

Except, I also want two other columns to count the number of males and females in each household , where in the "sex" column female == 1 and male == 0 .

Here's what I have so far:

households <- df %>%
  mutate_if(is.character, factor) %>%
  group_by(hh_id, treatment) %>%
  summarise_if(is.numeric, mean)
View(households)

which gives me this dataframe:

   hh_id treatment  hh_size   sex   yob g2000 g2002 g2004 p2000
   <dbl> <fct>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1 Civic Duty       2     1 1944  1     1         1   0  
 2     2 Hawthorne        3     1 1961  1     1         1   0  
 3     3 Control          3     1 1965. 0.667 0.667     1   0  
 4     4 Control          2     1 1968. 0.5   0.5       1   0  
 5     5 Control          1     1 1941  1     1         1   0  
 6     6 Hawthorne        2     1 1947  1     1         1   0  
 7     7 Control          1     1 1969  1     0         1   0  
 8     8 Control          2     1 1964  1     1         1   0.5
 9     9 Self             2     1 1956  0.5   0.5       1   0  
10    10 Control          1     1 1943  1     1         1   0  

Instead of summarise_if , use summarise with across (which is much more flexible). Also, the _if/_at/_all are deprecated

library(dplyr)
df1 %>% 
   group_by(hh_id, treatment) %>% 
   summarise(across(where(is.numeric), mean), 
     n_female = sum(sex == 1), n_male = sum(sex == 0))

The flexibility is that, we can pass multiple set of columns with difference functions in across as well as computation on a single column without across

data

df1 <- structure(list(treatment = c("Civic Duty", "Civic Duty", "Hawthorne", 
"Hawthorne", "Hawthorne", "Control", "Control", "Control", "Control", 
"Control"), hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L), 
    hh_size = c(2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L), sex = c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), yob = c(1941L, 1947L, 
    1951L, 1950L, 1982L, 1981L, 1959L, 1956L, 1968L, 1967L), 
    g2000 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2002 = c(1L, 
    1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L), g2004 = c(1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L), p2000 = c(0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM