How to get summary statistics by group

Question

I'm trying to get multiple summary statistics in R/S-PLUS grouped by categorical column in one shot. I found couple of functions, but all of them do one statistic per call, like aggregate() .

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)
mg <- aggregate(df$dt, by=df$group, FUN=mean)    
mg <- aggregate(df$dt, by=df$group, FUN=sum)

What I'm looking for is to get multiple statistics for the same group like mean, min, max, std, ...etc in one call, is that doable?

Answer 1

1. `tapply`

I'll put in my two cents for tapply() .

tapply(df$dt, df$group, summary)

You could write a custom function with the specific statistics you want or format the results:

tapply(df$dt, df$group,
  function(x) format(summary(x), scientific = TRUE))
$A
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.900e+01" "5.975e+01" "6.100e+01" "6.100e+01" "6.225e+01" "6.300e+01" 

$B
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.300e+01" "6.425e+01" "6.550e+01" "6.600e+01" "6.675e+01" "7.100e+01" 

$C
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.600e+01" "6.725e+01" "6.800e+01" "6.800e+01" "6.800e+01" "7.100e+01" 

$D
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.600e+01" "5.975e+01" "6.150e+01" "6.100e+01" "6.300e+01" "6.400e+01"

2. `data.table`

The data.table package offers a lot of helpful and fast tools for these types of operation:

library(data.table)
setDT(df)
> df[, as.list(summary(dt)), by = group]
   group Min. 1st Qu. Median Mean 3rd Qu. Max.
1:     A   59   59.75   61.0   61   62.25   63
2:     B   63   64.25   65.5   66   66.75   71
3:     C   66   67.25   68.0   68   68.00   71
4:     D   56   59.75   61.5   61   63.00   64

Answer 2

dplyr package could be nice alternative to this problem:

library(dplyr)

df %>% 
  group_by(group) %>% 
  summarize(mean = mean(dt),
            sum = sum(dt))

To get 1st quadrant and 3rd quadrant

df %>% 
  group_by(group) %>% 
  summarize(q1 = quantile(dt, 0.25),
            q3 = quantile(dt, 0.75))

Answer 3

Using Hadley Wickham's purrr package this is quite simple. Use split to split the passed data_frame into groups, then use map to apply the summary function to each group.

library(purrr)

df %>% split(.$group) %>% map(summary)

Answer 4

有很多不同的方法可以解决这个问题，但我偏爱在psych包中的describeBy ：

describeBy(df$dt, df$group, mat = TRUE)

Answer 5

take a look at the plyr package. Specifically, ddply

ddply(df, .(group), summarise, mean=mean(dt), sum=sum(dt))

Answer 6

after 5 long years I'm sure not much attention is going to be received for this answer, But still to make all options complete, here is the one with data.table

library(data.table)
setDT(df)[ , list(mean_gr = mean(dt), sum_gr = sum(dt)) , by = .(group)]
#   group mean_gr sum_gr
#1:     A      61    244
#2:     B      66    396
#3:     C      68    408
#4:     D      61    488

Answer 7

The psych package has a great option for grouped summary stats:

library(psych)
    
describeBy(dt, group="grp")

produces lots of useful stats including mean, median, range, sd, se.

Answer 8

Besides describeBy , the doBy package is an another option. It provides much of the functionality of SAS PROC SUMMARY. Details: http://www.statmethods.net/stats/descriptives.html

Answer 9

While some of the other approaches work, this is pretty close to what you were doing and only uses base r. If you know the aggregate command this may be more intuitive.

with( df , aggregate( dt , by=list(group) , FUN=summary)  )

Answer 10

Not sure why the popular skimr package hasn't been brought up. Their function skim() was meant to replace the base R summary() and supports dplyr grouping:

library(dplyr)
library(skimr)

starwars %>%
  group_by(gender) %>%
  skim()

#> ── Data Summary ────────────────────────
#>                            Values    
#> Name                       Piped data
#> Number of rows             87        
#> Number of columns          14        
#> _______________________              
#> Column type frequency:               
#>   character                7         
#>   list                     3         
#>   numeric                  3         
#> ________________________             
#> Group variables            gender    
#> 
#> ── Variable type: character ──────────────────────────────────────────────────────
#>    skim_variable gender    n_missing complete_rate   min   max empty n_unique
#>  1 name          feminine          0         1         3    18     0       17
#>  2 name          masculine         0         1         3    21     0       66
#>  3 name          <NA>              0         1         8    14     0        4
#>  4 hair_color    feminine          0         1         4     6     0        6
#>  5 hair_color    masculine         5         0.924     4    13     0        9
#>  6 hair_color    <NA>              0         1         4     7     0        4
#> # [...]
#> 
#> ── Variable type: list ───────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate n_unique min_length max_length
#> 1 films         feminine          0             1        9          1          5
#> 2 films         masculine         0             1       24          1          7
#> 3 films         <NA>              0             1        3          1          2
#> 4 vehicles      feminine          0             1        3          0          1
#> 5 vehicles      masculine         0             1        9          0          2
#> 6 vehicles      <NA>              0             1        1          0          0
#> # [...]
#> 
#> ── Variable type: numeric ────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate  mean     sd    p0   p25   p50
#> 1 height        feminine          1         0.941 165.   23.6     96 162.  166. 
#> 2 height        masculine         4         0.939 177.   37.6     66 171.  183  
#> 3 height        <NA>              1         0.75  181.    2.89   178 180.  183  
#> # [...]

Answer 11

I would also recommend gtsummary (written by Daniel D. Sjoberg et al). You can generate publication-ready or presentation-ready tables with the package. A gtsummary solution to the example given in the question would be:

library(tidyverse)
library(gtsummary)

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)


tbl_summary(df, 
            by=group,
            type = all_continuous() ~ "continuous2",
            statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
  add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

which then gives you the output below

Characteristic	A, N = 4	B, N = 6	C, N = 6	D, N = 8
dt
Mean (SD)	61.0 (1.8)	66.0 (2.8)	68.0 (1.7)	61.0 (2.6)
Meian (IQR)	61.0 (2.5)	65.5 (2.5)	68.0 (0.8)	61.5 (3.2)
Min- Max	59.0 - 63.0	63.0 - 71.0	66.0 - 71.0	56.0 - 64.0

You can also export the table as word document by doing the following:

Table1 <-  tbl_summary(df, 
                by=group,
                type = all_continuous() ~ "continuous2",
                statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
      add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

tmp1 <- "~path/name.docx"

Table1 %>% 
  as_flex_table() %>% 
  flextable::save_as_docx(path=tmp1)

You can use it for regression outputs as well. See the package reference manual and the package webpage for further insights

https://cran.r-project.org/web/packages/gtsummary/index.html https://www.danieldsjoberg.com/gtsummary/index.html

Answer 12

this may also work,

spl <- split(mtcars, mtcars$cyl)
list.of.summaries <- lapply(spl, function(x) data.frame(apply(x[,3:6], 2, summary)))
list.of.summaries

Answer 13

First, it depends on your version of R. If you've passed 2.11, you can use aggreggate with multiple results functions(summary, by instance, or your own function). If not, you can use the answer made by Justin.

Answer 14

With more recent (>1.0) versions of dplyr<\/code> you can do so with

iris %>% 
  group_by(Species)  %>% 
  summarise(as_tibble(rbind(summary(Sepal.Length))))

How to get summary statistics by group

Question

14 answers

solution1
129 2012-03-24 10:12:33

1. `tapply`

2. `data.table`

solution2
57 2014-11-10 10:59:06

solution3
39 2016-08-12 14:52:20

solution4
19 2012-03-24 05:46:24

solution5
12 2012-03-23 22:13:41

solution6
10 2017-01-23 16:53:44

solution7
7 2020-03-09 10:50:32

solution8
6 2013-12-26 05:04:51

solution9
5 2019-04-22 12:18:16

solution10
2 2021-05-10 20:43:02

solution11
2 2022-03-21 14:05:50

solution12
1 2021-03-03 08:47:54

solution13
1 2012-03-23 23:40:34

solution14
0 2021-11-28 18:55:32

How to get summary statistics by group

Question

14 answers

solution1 129 2012-03-24 10:12:33

1. tapply

2. data.table

solution2 57 2014-11-10 10:59:06

solution3 39 2016-08-12 14:52:20

solution4 19 2012-03-24 05:46:24

solution5 12 2012-03-23 22:13:41

solution6 10 2017-01-23 16:53:44

solution7 7 2020-03-09 10:50:32

solution8 6 2013-12-26 05:04:51

solution9 5 2019-04-22 12:18:16

solution10 2 2021-05-10 20:43:02

solution11 2 2022-03-21 14:05:50

solution12 1 2021-03-03 08:47:54

solution13 1 2012-03-23 23:40:34

solution14 0 2021-11-28 18:55:32

solution1
129 2012-03-24 10:12:33

1. `tapply`

2. `data.table`

solution2
57 2014-11-10 10:59:06

solution3
39 2016-08-12 14:52:20

solution4
19 2012-03-24 05:46:24

solution5
12 2012-03-23 22:13:41

solution6
10 2017-01-23 16:53:44

solution7
7 2020-03-09 10:50:32

solution8
6 2013-12-26 05:04:51

solution9
5 2019-04-22 12:18:16

solution10
2 2021-05-10 20:43:02

solution11
2 2022-03-21 14:05:50

solution12
1 2021-03-03 08:47:54

solution13
1 2012-03-23 23:40:34

solution14
0 2021-11-28 18:55:32