通过另一个变量计算几个二进制变量的比例

Question

I have data with several binary variables, and I want to calculate the proportion of each one, by another variable.我有几个二进制变量的数据，我想通过另一个变量计算每个变量的比例。

Example例子

I survey people and ask them:我调查人们并问他们：
Please mark which of the following fruits you like (can mark more than one choice):请标出您喜欢以下哪种水果（可多选）：
☐ Banana ☐ Apple ☐ Orange ☐ Strawberry ☐ Peach ☐ 香蕉 ☐ 苹果 ☐ 橙子 ☐ 草莓 ☐ 桃子

Each person who checked the box gets 1 in the data, and when leaving blank it's denoted as 0 .选中该框的每个人在数据中都得到1 ，当留空时，它表示为0 。 The data looks like that:数据如下所示：

library(dplyr)

set.seed(2021)

my_df <-
  matrix(rbinom(n = 100, size = 1, prob = runif(1)), ncol = 5) %>%
  as.data.frame() %>%
  cbind(1:20, ., sample(c("male", "female"), size = 20, replace = T)) %>%
  setNames(c("person_id", "banana", "apple", "orange", "strawberry", "peach", "gender"))

my_df
#>    person_id banana apple orange strawberry peach gender
#> 1          1      1     1      1          0     0 female
#> 2          2      1     0      0          0     1 female
#> 3          3      0     0      1          0     1 female
#> 4          4      1     1      0          1     0 female
#> 5          5      1     1      1          0     0   male
#> 6          6      1     1      1          0     1 female
#> 7          7      0     1      0          1     1   male
#> 8          8      1     1      0          0     0   male
#> 9          9      1     1      1          0     0 female
#> 10        10      0     0      0          0     0   male
#> 11        11      1     1      1          1     1   male
#> 12        12      1     1      0          0     1   male
#> 13        13      1     1      0          1     0   male
#> 14        14      1     1      0          0     0   male
#> 15        15      0     0      0          0     1   male
#> 16        16      0     1      0          0     1   male
#> 17        17      1     0      0          0     1   male
#> 18        18      1     1      1          1     1   male
#> 19        19      0     0      1          1     1 female
#> 20        20      0     0      0          0     0 female

^{Created on 2021-02-01 by the reprex package (v0.3.0)}^{由reprex package (v0.3.0) 于 2021 年 2 月 1 日创建}

I want to get the proportion for each fruit, split by gender .我想得到每个水果的比例，按gender划分。 From this answer I learned how to do it for one variable (for example, banana ):从这个答案中，我学会了如何为一个变量（例如， banana ）做这件事：

my_df %>%
  group_by(gender) %>%
  summarise(n_of_observations = n(), prop = sum(banana == 1)/n())

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   gender n_of_observations  prop
##   <chr>              <int> <dbl>
## 1 female                10   0.6
## 2 male                  10   0.4

But how can I get such a table for all fruits?但是我怎样才能得到一张适合所有水果的桌子呢？

Desired output:所需的 output：

##    fruit      gender  prop
##    <chr>      <chr>  <dbl>
##  1 banana     female   0.6
##  2 banana     male     0.4
##  3 apple      female   0.4
##  4 apple      male     0.3
##  5 orange     female   0.3
##  6 orange     male     0.1
##  7 strawberry female   0.4
##  8 strawberry male     0.4
##  9 peach      female   0.3
## 10 peach      male     0.6

I'm looking for a dplyr solution, if possible.如果可能的话，我正在寻找dplyr解决方案。 Thanks a lot!非常感谢！

Answer 1

You can use across to summarize multiple variables at once:您可以使用 cross 一次汇总across变量：

my_df %>%
  group_by(gender) %>%
  summarise(across(banana:peach, list(n = ~length(.x), prop = ~sum(.x == 1) / n())))


# A tibble: 2 x 11
  gender banana_n banana_prop apple_n apple_prop orange_n orange_prop strawberry_n strawberry_prop peach_n peach_prop
  <chr>     <int>       <dbl>   <int>      <dbl>    <int>       <dbl>        <int>           <dbl>   <int>      <dbl>
1 female        8       0.625       8       0.5         8       0.625            8           0.25        8      0.5  
2 male         12       0.667      12       0.75       12       0.25            12           0.333      12      0.583

Note that the first argument of across specifies the variables you want to summarize.请注意，cross 的第一个参数指定要汇总的变量。 Here, I wrote banana:peach meaning all columns between banana and peach .在这里，我写了banana:peach表示banana和peach之间的所有列。

Answer 2

You can use tidyr to pivot your data first and then summarize it:您可以先使用tidyr来 pivot 您的数据，然后对其进行汇总：

library(tidyr)

tidyr::pivot_longer(my_df, banana:peach,
                    names_to = "fruit") %>% 
  dplyr::group_by(gender, fruit) %>% 
  dplyr::summarize(prop = sum(value) / n())

   gender fruit       prop
   <chr>  <chr>      <dbl>
 1 female apple      0.5  
 2 female banana     0.625
 3 female orange     0.625
 4 female peach      0.5  
 5 female strawberry 0.25 
 6 male   apple      0.75 
 7 male   banana     0.667
 8 male   orange     0.25 
 9 male   peach      0.583
10 male   strawberry 0.333

You can pipe it to arrange if you want to sort by fruit .如果你想按fruit排序，你可以 pipe 它来arrange 。 You can also add the number of observations in the summarize function with n = n() .您还可以在summarize function 中添加观察数，其中n = n() 。

通过另一个变量计算几个二进制变量的比例

问题描述

Example例子

2 个解决方案

解决方案1
2 2021-02-01 21:24:38

解决方案2
1 已采纳 2021-02-01 21:26:36

通过另一个变量计算几个二进制变量的比例

问题描述

Example例子

2 个解决方案

解决方案1 2 2021-02-01 21:24:38

解决方案2 1 已采纳 2021-02-01 21:26:36

解决方案1
2 2021-02-01 21:24:38

解决方案2
1 已采纳 2021-02-01 21:26:36