简体   繁体   English

如何根据 R 中的列中的值聚合数据

[英]How to aggregate data based on values in a column in R

I am currently working on a project for work and I am struggling to summarize data correctly and I am worried that I am approaching this problem the wrong way.我目前正在做一个工作项目,我正在努力正确地总结数据,我担心我以错误的方式处理这个问题。 Basically, I have a dataset that looks like this:基本上,我有一个看起来像这样的数据集:

Month.Year Code Count
8/2017     1    1 
2/2018     1    1
4/2018     2    1
4/2018     2    1
5/2020     3    1
5/2020     3    1
.
.
.

I need to summarize this data so that I can create grouped barplots with dates being the groups and the codes being the subgroups.我需要总结这些数据,以便我可以创建分组条形图,其中日期是组,代码是子组。

In this data set we have a date column by Month/Year, a Categorical Code (a value between 1 and 3), and a "Count" column that I created which is just the value 1 for each observation (I'm hoping this makes it easier to "sum" the number of obs).在这个数据集中,我们有一个按月/年划分的日期列、一个分类代码(一个介于 1 和 3 之间的值)和一个我创建的“计数”列,它只是每个观察值的 1(我希望这个使“求和”obs 的数量更容易)。

The goal is to summarize this data at a Month and Code level for each year.目标是在每年的月份和代码级别汇总此数据。 In other words, I would like to have a different dataset for each year that looks something like this:换句话说,我希望每年都有一个不同的数据集,看起来像这样:

## Dataset for Year 2018
Month Code Value
1     1    24  
1     2    13  
1     3    0
2     1    0
2     2    5
2     3    22
.
.
.
## Dataset for Year 2019
Month Code Value
1     1    15  
1     2    2  
1     3    54
2     1    0
2     2    0
2     3    21
.
.
.

split the data set by year and then aggregate each sub-data.frame in a lapply loop.按年份split数据集,然后在lapply循环中aggregate每个子 data.frame。

Use sub to keep only the year to be used in the split instruction.使用sub只保留要在split指令中使用的年份。

df1 <- read.table(text = "
Month.Year Code Count
8/2017     1    1 
2/2018     1    1
4/2018     2    1
4/2018     2    1
5/2020     3    1
5/2020     3    1
", header = TRUE)
df1
#>   Month.Year Code Count
#> 1     8/2017    1     1
#> 2     2/2018    1     1
#> 3     4/2018    2     1
#> 4     4/2018    2     1
#> 5     5/2020    3     1
#> 6     5/2020    3     1

sub(".*/", "", df1$Month.Year)
#> [1] "2017" "2018" "2018" "2018" "2020" "2020"

Created on 2022-03-07 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-07

Now save the split result and loop to compute the sums.现在保存split结果并循环计算总和。

df1_year <- split(df1, sub(".*/", "", df1$Month.Year))
df1_year <- lapply(df1_year, \(x) {
  x$Month.Year <- sub("/\\d+$", "", x$Month.Year)
  names(x)[1] <- "Month"
  aggregate(Count ~ ., data = x, sum)
})

df1_year
#> $`2017`
#>   Month Code Count
#> 1     8    1     1
#> 
#> $`2018`
#>   Month Code Count
#> 1     2    1     1
#> 2     4    2     2
#> 
#> $`2020`
#>   Month Code Count
#> 1     5    3     2

Created on 2022-03-07 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-07

The result list members can be extracted with the standard extraction operators.可以使用标准提取运算符提取结果列表成员。

df_year[['2017']]   # by quoted name, '2017'
df_year[[1]]        # equivalent, 1st member

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM