简体   繁体   中英

R - how to split summarize tables without for loop?

I'm using R for a project, but I am very new to R and not very familiar with it. I have a single dataset, and I want to split it and display separate summaries using the summarize function. I wrote some code using a for loop, but I understand that for loops are usually avoided in R, due to its functional nature.

Basically, I want to learn how to convert my code into a more functional approach using a map or perhaps a group_split function, or whatever else would work. I've tried a few things and haven't figured it out yet.

I've written an example of what I am trying to do by using a built-in R database:

library(tidyverse)
data(mtcars)

unique_gears <- unique(mtcars$gear)

for (g in unique_gears){
 summ <- mtcars %>% filter(gear == g) %>% group_by(gear, cyl) %>%
    summarize(min = min(mpg), max = max(mpg), mean=mean(mpg))

print(summ)
}

Using the mtcars database, effectively what that does is print 3 separate summary tables, split out by the number of gears in the car, with each table showing the number of cylinders in the car and the mpg.

I tried to look at ways to do that without using the For loop.

For example, I tried this:

mtcars %>% group_by(gear) %>% group_split() %>% group_by(cyl) %>% summarize(min = min(mpg))

I have the second group_by in there because I want the final summarize output to be grouped by another column (and I am using cyl for this example).

We don't need a loop here. Instead, it is a grouping by two columns

library(dplyr) # 1.0.0
mtcars %>% 
    group_by(gear, cyl) %>% 
    summarise(across(mpg, list(Min = min, Max = max, Mean = mean)))
# A tibble: 8 x 5
# Groups:   gear [3]
#   gear   cyl mpg_Min mpg_Max mpg_Mean
#  <dbl> <dbl>   <dbl>   <dbl>    <dbl>
#1     3     4    21.5    21.5     21.5
#2     3     6    18.1    21.4     19.8
#3     3     8    10.4    19.2     15.0
#4     4     4    21.4    33.9     26.9
#5     4     6    17.8    21       19.8
#6     5     4    26      30.4     28.2
#7     5     6    19.7    19.7     19.7
#8     5     8    15      15.8     15.4

If we want a map solution, after the group_split on 'gear' (in the for loop it is looping over the unique values of 'gear' column), then map over the list and do a second grouping with cyl before summarise ing

library(purrr)
mtcars %>% 
   group_split(gear) %>%
   map(~ .x %>%
             group_by(cyl) %>%
             summarize(min = min(mpg), max = max(mpg), mean=mean(mpg)))

In addition to @akrun's answer, another way to solve the problem if you are interested in rendering the summarized data as a report is to use tables::tabular() .

library(tables)
tabular((Factor(gear) * Factor(cyl))~mpg*((n=1) + min + mean + max),data = mtcars)

...and the output:

          mpg                
 gear cyl n   min  mean  max 
 3    4    1  21.5 21.50 21.5
      6    2  18.1 19.75 21.4
      8   12  10.4 15.05 19.2
 4    4    8  21.4 26.93 33.9
      6    4  17.8 19.75 21.0
      8    0   Inf   NaN -Inf
 5    4    2  26.0 28.20 30.4
      6    1  19.7 19.70 19.7
      8    2  15.0 15.40 15.8

NOTE: unlike the dplyr solution, tabular() creates rows for every combination of the factor variables regardless of whether they have any observations, so it reports data for the missing row of 4 gears / 8 cylinders.

The output object from tables::tabular() can be printed in a high quality table with knitr::kable() and enhanced with the features in the kableExtra package.

A focus on descriptive statistics

If the desired outcome is simply to print descriptive statistics for a variable given a set of by group variables, we can also use psych::describeBy() .

library(psych)
describeBy(mtcars$mpg,list(mtcars$gear,mtcars$cyl))

...and the first few rows of output:

> describeBy(mtcars$mpg,list(mtcars$gear,mtcars$cyl))

 Descriptive statistics by group 
: 3
: 4
   vars n mean sd median trimmed mad  min  max range skew kurtosis se
X1    1 1 21.5 NA   21.5    21.5   0 21.5 21.5     0   NA       NA NA
------------------------------------------------------------- 
: 4
: 4
   vars n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
X1    1 8 26.92 4.81  25.85   26.92 5.56 21.4 33.9  12.5 0.25    -1.84 1.7
------------------------------------------------------------- 

Bottom line: there are many ways to accomplish a task in R, and it's important to know how the results will be used in order to determine the "best" solution for a particular situation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM