简体   繁体   中英

R group by multiple columns and mean value per each group based on different column

data set exist data with age, gender, state, income, group . Group represents the group that each user belongs to:

     group      gender state age       income
 1       3      Female  CA     33  $75,000 - $99,999
 2       3        Male  MA     41  $50,000 - $74,999
 3       3        Male  KY     32  $35,000 - $49,999
 4       2      Female  CA     23  $35,000 - $49,999
 5       3        Male  KY     25  $50,000 - $74,999
 6       3        Male  MA     21  $75,000 - $99,999
 7       3      Female  CA     33  $75,000 - $99,999
 8       3        Male  MA     41  $50,000 - $74,999
 9       3        Male  KY     32  $35,000 - $49,999
10       2      Female  CA     23  $35,000 - $49,999
11       3        Male  KY     25  $50,000 - $74,999
12       3      Female  MA     21  $75,000 - $99,999

Above is dummy data and goal is to get the concept correct.

The goal is to group by group, gender, income and get the count and for each group get the mean age from the users who belong to that group. Then set the data in following structure: "Expanded Version"

    group  male female CA  MA  KY  $35,000 - $49,999  $50,000 - $74,999 $75,000 - $99,999  mean_age
     2      0     2     2   0   0          2                1              0                   23
...

Here are the attempts: using dplyr

> data %>% group_by(group, 
+ gender, 
+ state, 
+ income) %>% 
+ summarize(n()) %>% 
+ mutate(mean_age = mean(age))

I was also exploring spread function.

You can do both the count and mean in one call to summarize() :

library(dplyr)    

data %>% group_by(group, 
                  gender, 
                  state, 
                  income) %>% 
  summarize(count = n(), mean_age = mean(age))

For the wide data, the variable names in your sample won't uniquely identify what a given data point means since the unique units are group X gender X state X income but it only has one row per group .

Since you have two summaries, the summary type is an additional layer to the unique identification. So to get everything in one row you would have variable names like [group]_[gender]_[state]_[income]_[summary] . For example, 2_Female_CA_$35,000 - $49,999_count .

There may be a better wide shape - what type of calculations are you doing on the wide data frame?

In addition to @treysp's answer you could use unite and spread to create a wide (and unwieldy) table. (I'm using as.data.frame() only to force printing all columns).

require(tidyverse);
df %>%
    group_by(group, gender, state, income) %>%
    summarize(n = n(), mean_age = mean(age)) %>%
    unite(key, gender, state, income) %>%
    spread(key, n) %>% as.data.frame();
#  group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1     2       23                           2                          NA
#2     3       21                          NA                          NA
#3     3       25                          NA                          NA
#4     3       32                          NA                          NA
#5     3       33                          NA                           2
#6     3       41                          NA                          NA
#  Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1                          NA                        NA
#2                           1                        NA
#3                          NA                        NA
#4                          NA                         2
#5                          NA                        NA
#6                          NA                        NA
#  Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1                        NA                        NA                        NA
#2                        NA                        NA                         1
#3                         2                        NA                        NA
#4                        NA                        NA                        NA
#5                        NA                        NA                        NA
#6                        NA                         2                        NA
#

Sample data

df <- read.table(text =
    "group      gender state age       income
 1       3      Female  CA     33  '$75,000 - $99,999'
 2       3        Male  MA     41  '$50,000 - $74,999'
 3       3        Male  KY     32  '$35,000 - $49,999'
 4       2      Female  CA     23  '$35,000 - $49,999'
 5       3        Male  KY     25  '$50,000 - $74,999'
 6       3        Male  MA     21  '$75,000 - $99,999'
 7       3      Female  CA     33  '$75,000 - $99,999'
 8       3        Male  MA     41  '$50,000 - $74,999'
 9       3        Male  KY     32  '$35,000 - $49,999'
10       2      Female  CA     23  '$35,000 - $49,999'
11       3        Male  KY     25  '$50,000 - $74,999'
12       3      Female  MA     21  '$75,000 - $99,999'", header = T, row.names = 1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM