简体   繁体   中英

How can I aggregate data with categorical responses to get the percentage of each response type in R?

I want to get percentages of categorical answer types for different types of questions (TYPE). I have multiple responses for each type for each individual, with multiple, categorical responses (different levels).

1) each individual should be on a different row, and
2) the columns should be the TYPES+Response Level, with the value being percentage of times that particular response level was given for that question type for that individual.

The DATA looks like this:

SUBJECT TYPE    RESPONSE  
John    a   kappa                       
John    b   gamma  
John    a   delta  
John    a   gamma  
Mary    a   kappa   
Mary    a   delta       
Mary    b   kappa  
Mary    a   gamma  
Bill    b   delta  
Bill    a   gamma  

The result should look like this:

SUBJECT a-kappa     a-gamma   a-delta   b-kappa     b-gamma b-delta
John    0.33        0.33      0.33      1.00        1.00    0.00
Mary    0.66        0.33      0.00      1.00        0.00    0.00
Bill    1.00        0.00      0.00      0.00        0.00    1.00

Based on c1au61o_HH's answer I was able to create something that works for my actual data file, but will still need some post-processing. (It is also not very elegant, but that's a minor concern.)

 Finaldf <- mydata %>%     
 group_by(Subject,Type) %>%     
 mutate(TOT = n()) %>%      
 group_by(Subject, Response, Type) %>%     
 mutate(RESPTOT = n())     

 Finaldf <- distinct(Finaldf)    
 Finaldf$Percentage <- Finaldf$RESPTOT/Finaldf$TOT    

Any help is much appreciated, also please with some explanation.

Probably this is not the most efficient way, but if you want to use tidyverse you can unite the 2 columns and then do 2 different group_by to calculate totals for each subjects and percents.

library(tidyverse)
df %>% 
  unite(TYPE_RESPONSE, c("TYPE", "RESPONSE"), sep = "_") %>% 
  group_by(SUBJECT) %>% 
  mutate(TOT = n()) %>% 
  group_by(SUBJECT, TYPE_RESPONSE) %>% 
  summarize(perc = n()/TOT * 100) %>% 
  spread(TYPE_RESPONSE, perc)

DATA:

df <- tibble( SUBJECT= rep(c("John", "Mary","Bill"), each = 4), 
                 TYPE = rep(c("a","b"), 6),
                 RESPONSE = rep(c("kappa", "gamma", "delta"), 4)
)

EDIT in reply to comment:

I understand that you want to calculate the percentage by SUBJECT and TYPE , so the code would be something like this:

library(tidyverse)
df %>% 
  group_by(SUBJECT, TYPE) %>% 
  mutate(TOT = n()) %>%
  unite(TYPE_RESPONSE, c("TYPE", "RESPONSE"), sep = "_") %>% 
  group_by(SUBJECT, TYPE_RESPONSE) %>% 
  summarize(perc = n()/TOT * 100)%>% 
  spread(TYPE_RESPONSE, perc)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM