简体   繁体   English

按组使用数值变量来获取汇总统计信息

[英]Using numerical variables as by group to get summary statistics

I have data as follows:我有如下数据:

library(data.table)
dat <- fread("total women young
              1       0      0
              1       1      1
              1       0      1
              2       1      1
              2       2      1
              2       2      1
              3       1      2
              3       2      3
              3       2      3
              4       4      2
              4       4      3
              4       3      3
              5       5      2
              5       2      3
              5       5      3
              10       4      2
              10       4      3
              20       5      3
             100      10     20")

I would like to create six categories for the variable tot_num :我想为变量tot_num创建六个类别:

1,2,3,4,5 and over 5.

I would like to count the observations per category total in count .我想在count中计算每个类别的观察total sum_tot would simply be these multiplied. sum_tot就是这些的乘积。 And women and young are the average amount of women and young people in that group. womenyoung是该群体中女性和年轻人的平均数量。

Desired output所需 output

            total count sum_tot_count women young
              1       3      3          0.33   0.66
              2       3      6          5/6    0.5
              3       3      9          5/9    8/9
              4       3      12         11/12  10/12
              5       3      15         12/15  8/15
              over 5  4      140        23/140 28/140

I am having some trouble figuring out where to start.我在弄清楚从哪里开始时遇到了一些麻烦。

Could someone lead me on the right track?有人可以引导我走上正确的轨道吗?

Does this work:这行得通吗:

library(dplyr)
dat %>% mutate(tot = if_else(total > 5, 'over 5', as.character(total))) %>% 
      group_by(tot) %>% summarise(count = n(), sum_tot_count = sum(total), women = sum(women)/sum(total), young = sum(young)/sum(total))
# A tibble: 6 × 5
  tot    count sum_tot_count women young
  <chr>  <int>         <int> <dbl> <dbl>
1 1          3             3 0.333 0.667
2 2          3             6 0.833 0.5  
3 3          3             9 0.556 0.889
4 4          3            12 0.917 0.667
5 5          3            15 0.8   0.533
6 over 5     4           140 0.164 0.2  

With cut :随着cut

dat %>% 
  group_by(cutGroup = cut(total, breaks = c(1:6, Inf), labels = c(1:5, "over 5"), include.lowest = TRUE, right = FALSE)) %>% 
  summarise(count = n(),
            sum_tot_count = sum(total),
            women = sum(women) / sum(total),
            young = sum(young) / sum(total))     

A data.table solution. data.table解决方案。 The key is using cut() , as in other answers;关键是使用cut() ,就像其他答案一样; after that, basic data.table syntax as in Use data.table to count and aggregate / summarize a column will get you the rest of the way:之后, 使用 data.table 中的基本 data.table 语法来计算和聚合/汇总列将为您提供 rest 的方式:

dat[, cat := cut(total, breaks = 0.5 + c(0:5,Inf), labels = c(1:5, "over 5"))]
      .(count = n())]
dat[,.(count=.N, 
       total = sum(total), 
       women = sum(women)/sum(total),
       young = sum(young)/sum(total)), 
    by = cat]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何按组获取汇总统计信息 - How to get summary statistics by group 数值变量和 2 因子变量的汇总统计(SAS 中的这些命令在 R 中是什么?) - Summary statistics of numerical and 2 factor variables (what would these commands in SAS be in R?) 如何使用dplyr根据组上的聚合函数计算新列(在摘要统计信息上添加汇总统计信息)? - How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)? 使用 dplyr 按数字类别提取汇总统计信息 - Distilling summary statistics by numerical categories with dplyr 按事件序列分组并获取每个序列的摘要统计信息 - group by sequence of events and get summary statistics for each sequence 如何获取多个组的多个变量的摘要统计信息? - How to get summary statistics for multiple variables by multiple groups? 获取同时包含多个变量的基本统计信息的汇总表 - Get a summary table with basic statistics for several variables at the same time 使用dplyr对R中所有因变量进行分组汇总统计 - Groupwise summary statistics for all dependent variables in R using dplyr 使用ddply的摘要统计信息 - Summary statistics using ddply 多个变量的汇总统计数据,统计数据作为行,变量作为列? - Summary statistics for multiple variables with statistics as rows and variables as columns?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM