[英]How to aggregate data based on values in a column in R
I am currently working on a project for work and I am struggling to summarize data correctly and I am worried that I am approaching this problem the wrong way.我目前正在做一个工作项目,我正在努力正确地总结数据,我担心我以错误的方式处理这个问题。 Basically, I have a dataset that looks like this:
基本上,我有一个看起来像这样的数据集:
Month.Year Code Count
8/2017 1 1
2/2018 1 1
4/2018 2 1
4/2018 2 1
5/2020 3 1
5/2020 3 1
.
.
.
I need to summarize this data so that I can create grouped barplots with dates being the groups and the codes being the subgroups.我需要总结这些数据,以便我可以创建分组条形图,其中日期是组,代码是子组。
In this data set we have a date column by Month/Year, a Categorical Code (a value between 1 and 3), and a "Count" column that I created which is just the value 1 for each observation (I'm hoping this makes it easier to "sum" the number of obs).在这个数据集中,我们有一个按月/年划分的日期列、一个分类代码(一个介于 1 和 3 之间的值)和一个我创建的“计数”列,它只是每个观察值的 1(我希望这个使“求和”obs 的数量更容易)。
The goal is to summarize this data at a Month and Code level for each year.目标是在每年的月份和代码级别汇总此数据。 In other words, I would like to have a different dataset for each year that looks something like this:
换句话说,我希望每年都有一个不同的数据集,看起来像这样:
## Dataset for Year 2018
Month Code Value
1 1 24
1 2 13
1 3 0
2 1 0
2 2 5
2 3 22
.
.
.
## Dataset for Year 2019
Month Code Value
1 1 15
1 2 2
1 3 54
2 1 0
2 2 0
2 3 21
.
.
.
split
the data set by year and then aggregate
each sub-data.frame in a lapply
loop.按年份
split
数据集,然后在lapply
循环中aggregate
每个子 data.frame。
Use sub
to keep only the year to be used in the split
instruction.使用
sub
只保留要在split
指令中使用的年份。
df1 <- read.table(text = "
Month.Year Code Count
8/2017 1 1
2/2018 1 1
4/2018 2 1
4/2018 2 1
5/2020 3 1
5/2020 3 1
", header = TRUE)
df1
#> Month.Year Code Count
#> 1 8/2017 1 1
#> 2 2/2018 1 1
#> 3 4/2018 2 1
#> 4 4/2018 2 1
#> 5 5/2020 3 1
#> 6 5/2020 3 1
sub(".*/", "", df1$Month.Year)
#> [1] "2017" "2018" "2018" "2018" "2020" "2020"
Created on 2022-03-07 by the reprex package (v2.0.1)由reprex package (v2.0.1) 创建于 2022-03-07
Now save the split
result and loop to compute the sums.现在保存
split
结果并循环计算总和。
df1_year <- split(df1, sub(".*/", "", df1$Month.Year))
df1_year <- lapply(df1_year, \(x) {
x$Month.Year <- sub("/\\d+$", "", x$Month.Year)
names(x)[1] <- "Month"
aggregate(Count ~ ., data = x, sum)
})
df1_year
#> $`2017`
#> Month Code Count
#> 1 8 1 1
#>
#> $`2018`
#> Month Code Count
#> 1 2 1 1
#> 2 4 2 2
#>
#> $`2020`
#> Month Code Count
#> 1 5 3 2
Created on 2022-03-07 by the reprex package (v2.0.1)由reprex package (v2.0.1) 创建于 2022-03-07
The result list members can be extracted with the standard extraction operators.可以使用标准提取运算符提取结果列表成员。
df_year[['2017']] # by quoted name, '2017'
df_year[[1]] # equivalent, 1st member
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.