[英]Calculate proportion of several binary variables by another variable
I have data with several binary variables, and I want to calculate the proportion of each one, by another variable.我有几个二进制变量的数据,我想通过另一个变量计算每个变量的比例。
I survey people and ask them:我调查人们并问他们:
Please mark which of the following fruits you like (can mark more than one choice):请标出您喜欢以下哪种水果(可多选):
☐ Banana ☐ Apple ☐ Orange ☐ Strawberry ☐ Peach ☐ 香蕉 ☐ 苹果 ☐ 橙子 ☐ 草莓 ☐ 桃子
Each person who checked the box gets 1
in the data, and when leaving blank it's denoted as 0
.选中该框的每个人在数据中都得到1
,当留空时,它表示为0
。 The data looks like that:数据如下所示:
library(dplyr)
set.seed(2021)
my_df <-
matrix(rbinom(n = 100, size = 1, prob = runif(1)), ncol = 5) %>%
as.data.frame() %>%
cbind(1:20, ., sample(c("male", "female"), size = 20, replace = T)) %>%
setNames(c("person_id", "banana", "apple", "orange", "strawberry", "peach", "gender"))
my_df
#> person_id banana apple orange strawberry peach gender
#> 1 1 1 1 1 0 0 female
#> 2 2 1 0 0 0 1 female
#> 3 3 0 0 1 0 1 female
#> 4 4 1 1 0 1 0 female
#> 5 5 1 1 1 0 0 male
#> 6 6 1 1 1 0 1 female
#> 7 7 0 1 0 1 1 male
#> 8 8 1 1 0 0 0 male
#> 9 9 1 1 1 0 0 female
#> 10 10 0 0 0 0 0 male
#> 11 11 1 1 1 1 1 male
#> 12 12 1 1 0 0 1 male
#> 13 13 1 1 0 1 0 male
#> 14 14 1 1 0 0 0 male
#> 15 15 0 0 0 0 1 male
#> 16 16 0 1 0 0 1 male
#> 17 17 1 0 0 0 1 male
#> 18 18 1 1 1 1 1 male
#> 19 19 0 0 1 1 1 female
#> 20 20 0 0 0 0 0 female
Created on 2021-02-01 by the reprex package (v0.3.0)由reprex package (v0.3.0) 于 2021 年 2 月 1 日创建
I want to get the proportion for each fruit, split by gender
.我想得到每个水果的比例,按gender
划分。 From this answer I learned how to do it for one variable (for example, banana
):从这个答案中,我学会了如何为一个变量(例如, banana
)做这件事:
my_df %>%
group_by(gender) %>%
summarise(n_of_observations = n(), prop = sum(banana == 1)/n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## gender n_of_observations prop
## <chr> <int> <dbl>
## 1 female 10 0.6
## 2 male 10 0.4
But how can I get such a table for all fruits?但是我怎样才能得到一张适合所有水果的桌子呢?
Desired output:所需的 output:
## fruit gender prop
## <chr> <chr> <dbl>
## 1 banana female 0.6
## 2 banana male 0.4
## 3 apple female 0.4
## 4 apple male 0.3
## 5 orange female 0.3
## 6 orange male 0.1
## 7 strawberry female 0.4
## 8 strawberry male 0.4
## 9 peach female 0.3
## 10 peach male 0.6
I'm looking for a dplyr
solution, if possible.如果可能的话,我正在寻找dplyr
解决方案。 Thanks a lot!非常感谢!
You can use across
to summarize multiple variables at once:您可以使用 cross 一次汇总across
变量:
my_df %>%
group_by(gender) %>%
summarise(across(banana:peach, list(n = ~length(.x), prop = ~sum(.x == 1) / n())))
# A tibble: 2 x 11
gender banana_n banana_prop apple_n apple_prop orange_n orange_prop strawberry_n strawberry_prop peach_n peach_prop
<chr> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl>
1 female 8 0.625 8 0.5 8 0.625 8 0.25 8 0.5
2 male 12 0.667 12 0.75 12 0.25 12 0.333 12 0.583
Note that the first argument of across specifies the variables you want to summarize.请注意,cross 的第一个参数指定要汇总的变量。 Here, I wrote banana:peach
meaning all columns between banana
and peach
.在这里,我写了banana:peach
表示banana
和peach
之间的所有列。
You can use tidyr
to pivot your data first and then summarize it:您可以先使用tidyr
来 pivot 您的数据,然后对其进行汇总:
library(tidyr)
tidyr::pivot_longer(my_df, banana:peach,
names_to = "fruit") %>%
dplyr::group_by(gender, fruit) %>%
dplyr::summarize(prop = sum(value) / n())
gender fruit prop
<chr> <chr> <dbl>
1 female apple 0.5
2 female banana 0.625
3 female orange 0.625
4 female peach 0.5
5 female strawberry 0.25
6 male apple 0.75
7 male banana 0.667
8 male orange 0.25
9 male peach 0.583
10 male strawberry 0.333
You can pipe it to arrange
if you want to sort by fruit
.如果你想按fruit
排序,你可以 pipe 它来arrange
。 You can also add the number of observations in the summarize
function with n = n()
.您还可以在summarize
function 中添加观察数,其中n = n()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.