简体   繁体   English

如何按组获取汇总统计信息

[英]How to get summary statistics by group

I'm trying to get multiple summary statistics in R/S-PLUS grouped by categorical column in one shot.我试图一次获得按分类列分组的 R/S-PLUS 中的多个摘要统计信息。 I found couple of functions, but all of them do one statistic per call, like aggregate() .我发现了几个函数,但它们每次调用都做一个统计,比如aggregate()

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)
mg <- aggregate(df$dt, by=df$group, FUN=mean)    
mg <- aggregate(df$dt, by=df$group, FUN=sum)    

What I'm looking for is to get multiple statistics for the same group like mean, min, max, std, ...etc in one call, is that doable?我正在寻找的是在一次调用中获取同一组的多个统计信息,如平均值、最小值、最大值、标准值等,这可行吗?

1. tapply 1. tapply

I'll put in my two cents for tapply() .我会为tapply()投入两分钱。

tapply(df$dt, df$group, summary)

You could write a custom function with the specific statistics you want or format the results:您可以使用所需的特定统计信息编写自定义函数或格式化结果:

tapply(df$dt, df$group,
  function(x) format(summary(x), scientific = TRUE))
$A
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.900e+01" "5.975e+01" "6.100e+01" "6.100e+01" "6.225e+01" "6.300e+01" 

$B
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.300e+01" "6.425e+01" "6.550e+01" "6.600e+01" "6.675e+01" "7.100e+01" 

$C
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.600e+01" "6.725e+01" "6.800e+01" "6.800e+01" "6.800e+01" "7.100e+01" 

$D
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.600e+01" "5.975e+01" "6.150e+01" "6.100e+01" "6.300e+01" "6.400e+01"

2. data.table 2. data.table

The data.table package offers a lot of helpful and fast tools for these types of operation: data.table包为这些类型的操作提供了许多有用且快速的工具:

library(data.table)
setDT(df)
> df[, as.list(summary(dt)), by = group]
   group Min. 1st Qu. Median Mean 3rd Qu. Max.
1:     A   59   59.75   61.0   61   62.25   63
2:     B   63   64.25   65.5   66   66.75   71
3:     C   66   67.25   68.0   68   68.00   71
4:     D   56   59.75   61.5   61   63.00   64

dplyr package could be nice alternative to this problem: dplyr包可以很好地替代这个问题:

library(dplyr)

df %>% 
  group_by(group) %>% 
  summarize(mean = mean(dt),
            sum = sum(dt))

To get 1st quadrant and 3rd quadrant获得第一象限和第三象限

df %>% 
  group_by(group) %>% 
  summarize(q1 = quantile(dt, 0.25),
            q3 = quantile(dt, 0.75))

Using Hadley Wickham's purrr package this is quite simple.使用 Hadley Wickham 的 purrr 包非常简单。 Use split to split the passed data_frame into groups, then use map to apply the summary function to each group.使用split将传递的data_frame分成组,然后使用mapsummary函数应用于每个组。

library(purrr)

df %>% split(.$group) %>% map(summary)

有很多不同的方法可以解决这个问题,但我偏爱在psych包中的describeBy

describeBy(df$dt, df$group, mat = TRUE) 

take a look at the plyr package.看看plyr包。 Specifically, ddply具体来说, ddply

ddply(df, .(group), summarise, mean=mean(dt), sum=sum(dt))

after 5 long years I'm sure not much attention is going to be received for this answer, But still to make all options complete, here is the one with data.table经过 5 年的漫长岁月,我确信这个答案不会受到太多关注,但仍然要使所有选项都完整,这是带有data.table的选项

library(data.table)
setDT(df)[ , list(mean_gr = mean(dt), sum_gr = sum(dt)) , by = .(group)]
#   group mean_gr sum_gr
#1:     A      61    244
#2:     B      66    396
#3:     C      68    408
#4:     D      61    488 

The psych package has a great option for grouped summary stats: psych包有一个很好的分组汇总统计选项:

library(psych)
    
describeBy(dt, group="grp")

produces lots of useful stats including mean, median, range, sd, se.产生许多有用的统计数据,包括平均值、中位数、范围、标准差、标准差。

Besides describeBy , the doBy package is an another option.除了describeBy之外, doBy包是另一种选择。 It provides much of the functionality of SAS PROC SUMMARY.它提供了 SAS PROC Summary 的大部分功能。 Details: http://www.statmethods.net/stats/descriptives.html详情:http: //www.statmethods.net/stats/descriptives.html

While some of the other approaches work, this is pretty close to what you were doing and only uses base r.虽然其他一些方法有效,但这与您所做的非常接近,并且仅使用 base r。 If you know the aggregate command this may be more intuitive.如果您知道聚合命令,这可能更直观。

with( df , aggregate( dt , by=list(group) , FUN=summary)  )

Not sure why the popular skimr package hasn't been brought up.不知道为什么没有提出流行的skimr Their function skim() was meant to replace the base R summary() and supports dplyr grouping:他们的函数skim()旨在替换基本的 R summary()并支持dplyr分组:

library(dplyr)
library(skimr)

starwars %>%
  group_by(gender) %>%
  skim()

#> ── Data Summary ────────────────────────
#>                            Values    
#> Name                       Piped data
#> Number of rows             87        
#> Number of columns          14        
#> _______________________              
#> Column type frequency:               
#>   character                7         
#>   list                     3         
#>   numeric                  3         
#> ________________________             
#> Group variables            gender    
#> 
#> ── Variable type: character ──────────────────────────────────────────────────────
#>    skim_variable gender    n_missing complete_rate   min   max empty n_unique
#>  1 name          feminine          0         1         3    18     0       17
#>  2 name          masculine         0         1         3    21     0       66
#>  3 name          <NA>              0         1         8    14     0        4
#>  4 hair_color    feminine          0         1         4     6     0        6
#>  5 hair_color    masculine         5         0.924     4    13     0        9
#>  6 hair_color    <NA>              0         1         4     7     0        4
#> # [...]
#> 
#> ── Variable type: list ───────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate n_unique min_length max_length
#> 1 films         feminine          0             1        9          1          5
#> 2 films         masculine         0             1       24          1          7
#> 3 films         <NA>              0             1        3          1          2
#> 4 vehicles      feminine          0             1        3          0          1
#> 5 vehicles      masculine         0             1        9          0          2
#> 6 vehicles      <NA>              0             1        1          0          0
#> # [...]
#> 
#> ── Variable type: numeric ────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate  mean     sd    p0   p25   p50
#> 1 height        feminine          1         0.941 165.   23.6     96 162.  166. 
#> 2 height        masculine         4         0.939 177.   37.6     66 171.  183  
#> 3 height        <NA>              1         0.75  181.    2.89   178 180.  183  
#> # [...]

I would also recommend gtsummary (written by Daniel D. Sjoberg et al).我还推荐 gtsummary(由 Daniel D. Sjoberg 等人编写)。 You can generate publication-ready or presentation-ready tables with the package.您可以使用该包生成发布就绪或演示就绪的表格。 A gtsummary solution to the example given in the question would be:问题中给出的示例的 gtsummary 解决方案是:

library(tidyverse)
library(gtsummary)

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)


tbl_summary(df, 
            by=group,
            type = all_continuous() ~ "continuous2",
            statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
  add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

which then gives you the output below然后给你下面的输出

Characteristic特征 A, N = 4一个,N = 4 B, N = 6 B、N = 6 C, N = 6 C、N = 6 D, N = 8 D、N = 8
dt dt
Mean (SD)平均值(标准差) 61.0 (1.8) 61.0 (1.8) 66.0 (2.8) 66.0 (2.8) 68.0 (1.7) 68.0 (1.7) 61.0 (2.6) 61.0 (2.6)
Meian (IQR)美安 (IQR) 61.0 (2.5) 61.0 (2.5) 65.5 (2.5) 65.5 (2.5) 68.0 (0.8) 68.0 (0.8) 61.5 (3.2) 61.5 (3.2)
Min- Max最小-最大 59.0 - 63.0 59.0 - 63.0 63.0 - 71.0 63.0 - 71.0 66.0 - 71.0 66.0 - 71.0 56.0 - 64.0 56.0 - 64.0

You can also export the table as word document by doing the following:您还可以通过执行以下操作将表格导出为 word 文档:

Table1 <-  tbl_summary(df, 
                by=group,
                type = all_continuous() ~ "continuous2",
                statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
      add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

tmp1 <- "~path/name.docx"

Table1 %>% 
  as_flex_table() %>% 
  flextable::save_as_docx(path=tmp1)

You can use it for regression outputs as well.您也可以将其用于回归输出。 See the package reference manual and the package webpage for further insights请参阅包参考手册和包网页以获取更多信息

https://cran.r-project.org/web/packages/gtsummary/index.html https://www.danieldsjoberg.com/gtsummary/index.htmlhttps://cran.r-project.org/web/packages/gtsummary/index.html https://www.danieldsjoberg.com/gtsummary/index.html

this may also work, 也可能有效,

spl <- split(mtcars, mtcars$cyl)
list.of.summaries <- lapply(spl, function(x) data.frame(apply(x[,3:6], 2, summary)))
list.of.summaries

First, it depends on your version of R. If you've passed 2.11, you can use aggreggate with multiple results functions(summary, by instance, or your own function).首先,这取决于您的 R 版本。如果您已通过 2.11,则可以将聚合门与多个结果函数(摘要、实例或您自己的函数)一起使用。 If not, you can use the answer made by Justin.如果没有,您可以使用贾斯汀的答案。

With more recent (>1.0) versions of dplyr<\/code> you can do so with使用dplyr<\/code>更新(>1.0)版本,您可以使用

iris %>% 
  group_by(Species)  %>% 
  summarise(as_tibble(rbind(summary(Sepal.Length))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 ggplot2 中按组显示汇总统计信息 - How to display summary statistics by group in ggplot2 如何创建 function 以获取汇总统计信息作为列? - How to create a function to get summary statistics as columns? 如何使用dplyr根据组上的聚合函数计算新列(在摘要统计信息上添加汇总统计信息)? - How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)? 按事件序列分组并获取每个序列的摘要统计信息 - group by sequence of events and get summary statistics for each sequence 如何获取多个组的多个变量的摘要统计信息? - How to get summary statistics for multiple variables by multiple groups? 如何从面板数据中按年份获取 Stargazer 汇总统计数据 - How to get Stargazer Summary Statistics by Year from Panel Data 如何更正我的模型以获取摘要统计信息? - How can I correct my model to get summary statistics? Ggplot boxplot 按组,更改显示的汇总统计信息 - Ggplot boxplot by group, change summary statistics shown 每组可视化每天的摘要统计信息 - Visualise summary statistics per day, per group 按组查找列中最低数字的汇总统计信息 - finding summary statistics of lowest number in column by group
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM