简体   繁体   English

汇总但保持长度可变(dplyr)

[英]Summarise but keep length variable (dplyr)

Basic dplyr question... Respondents could select multiple companies that they use. dplyr基本问题...受访者可以选择他们使用的多家公司。 For example: 例如:

library(dplyr)
test <- tibble(
 CompanyA = rep(c(0:1),5),
 CompanyB = rep(c(1),10),
 CompanyC = c(1,1,1,1,0,0,1,1,1,1)
)
test

If it were a forced-choice question - ie, respondents could make only one selection - I would do the following for a basic summary table: 如果这是一个强制选择的问题(即,受访者只能选择一个问题),我将对基本汇总表执行以下操作:

test %>% 
  summarise_all(funs(sum), na.rm = TRUE) %>% 
  gather(Response, n) %>% 
  arrange(desc(n)) %>% 
  mutate("%" = round(100*n/sum(n)))

Note, however, that the "%" column is not what I want. 但是请注意,“%”列不是我想要的。 I'm instead looking for the proportion of total respondents for each individual response option (since they could make multiple selections). 相反,我正在寻找每个响应选项的总答复者的比例 (因为他们可以进行多项选择)。

I've tried adding mutate(totalrows = nrow(.)) %>% prior to the summarise_all command. 我已经尝试添加mutate(totalrows = nrow(.)) %>%之前summarise_all命令。 This would allow me to use that variable as the denominator in a later mutate command. 这将允许我在以后的mutate命令中将该变量用作分母。 However, summarise_all eliminates the "totalrows" var. 但是, summarise_all消除了“总计”变量。

Also, if there's a better way to do this, I'm open to ideas. 另外,如果有更好的方法可以做到这一点,那么我也乐于接受。

To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. 要获得在该变量为二元变量时选择某个选项的受访者比例,可以取平均值。 To do this with your test data, you can use sapply : 为此,您可以使用sapply

sapply(test, mean)
CompanyA CompanyB CompanyC 
     0.5      1.0      0.8 

If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following: 如果您想以更复杂的方式执行此操作(例如,您的数据不是二进制编码的,而是存储为12 ),则可以使用以下方法:

test %>% 
    gather(key='Company') %>% 
    group_by(Company) %>% 
    summarise(proportion = sum(value == 1) / n())

# A tibble: 3 x 2
  Company  proportion
  <chr>         <dbl>
1 CompanyA        0.5
2 CompanyB        1  
3 CompanyC        0.8

If you put all functions in a list within summarise, then this will work. 如果将所有功能汇总放在一个列表中,那么它将起作用。 You'll need to do some quick tidying up after though. 不过,您需要快速整理一下。

test %>% 
  summarise_all(
    list(
      rows = length,
      n = function(x){sum(x, na.rm = T)},
      perc = function(x){sum(x,na.rm = T)/length(x)}
    )) %>%
  tidyr::gather(Response, n) %>%
  tidyr::separate(Response, c("Company", "Metric"), '_') %>%
  tidyr::spread(Metric, n)

And you'll get this 你会得到这个

  Company      n  perc  rows
  <chr>    <dbl> <dbl> <dbl>
1 CompanyA     5   0.5    10
2 CompanyB    10   1      10
3 CompanyC     8   0.8    10

Here is a solution using tidyr::gather : 这是使用tidyr::gather的解决方案:

test %>% 
  gather(Company, response) %>% 
  group_by(Company) %>% 
  summarise(`%` = 100 * sum(response) / n())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM