简体   繁体   English

根据日历年计算大数据表的每日平均值

[英]calculate daily mean of big data table depending on calendar year

I get a data table from a server that shows price predictions depending on the selected month of a calendar year.我从服务器获取了一个数据表,该表根据日历年的选定月份显示价格预测。 Basically, data is downloaded from every month of the year.基本上,数据是从一年中的每个月下载的。 Here is an example data table:下面是一个示例数据表:

set.seed(123)
dt.data <- data.table(Date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 365),
                      'BRN Jan-2021' = rnorm(365, 2, 1), 'BRN Jan-2022' = rnorm(365, 2, 1),
                      'BRN Feb-2021' = rnorm(365, 2, 1), 'BRN Feb-2022' = rnorm(365, 2, 1),
                      'BRN Mar-2021' = rnorm(365, 2, 1), 'BRN Mar-2022' = rnorm(365, 2, 1),
                      'BRN Apr-2021' = rnorm(365, 2, 1), 'BRN Apr-2022' = rnorm(365, 2, 1),
                      'BRN May-2021' = rnorm(365, 2, 1), 'BRN May-2022' = rnorm(365, 2, 1),
                      'BRN Jun-2021' = rnorm(365, 2, 1), 'BRN Jun-2022' = rnorm(365, 2, 1),
                      'BRN Jul-2021' = rnorm(365, 2, 1), 'BRN Jul-2022' = rnorm(365, 2, 1),
                      'BRN Aug-2021' = rnorm(365, 2, 1), 'BRN Aug-2022' = rnorm(365, 2, 1),
                      'BRN Sep-2021' = rnorm(365, 2, 1), 'BRN Sep-2022' = rnorm(365, 2, 1),
                      'BRN Oct-2021' = rnorm(365, 2, 1), 'BRN Oct-2022' = rnorm(365, 2, 1),
                      'BRN Nov-2021' = rnorm(365, 2, 1), 'BRN Nov-2022' = rnorm(365, 2, 1),
                      'BRN Dec-2021' = rnorm(365, 2, 1), 'BRN Dec-2022' = rnorm(365, 2, 1),
                      check.names = FALSE)

This data table is quite small as I only created data for the years 2021 and 2022. But there can be several calendar years, or just one calendar year.这个数据表很小,因为我只创建了 2021 年和 2022 年的数据。但是可以有几个日历年,或者只有一个日历年。

Now I would like to calculate daily mean values (based on the date column) for the year 2021 (ie the sum of all 12 values per day / date divided by 12 = number of months per calendar year) and save them in a new data table as a column.现在我想计算 2021 年的每日平均值(基于日期列)(即每天/日期所有 12 个值的总和除以 12 = 每个日历年的月数)并将它们保存在新数据中表作为一列。 And now of course the same for 2022.现在当然 2022 年也是如此。

In this case, the new data table should have the following columns:在这种情况下,新数据表应具有以下列:

| | Date |日期 | BRN Cal-2021 | BRN Cal-2021 | BRN Cal-2022 | BRN Cal-2022 |

where the date column remains unchanged.其中日期列保持不变。

The calculation and the column designation for the new data table should always be variable (depending on how many calendar years appear in dt.data ).新数据表的计算和列指定应始终是可变的(取决于dt.data出现的日历年dt.data )。 Basically, it might make sense to organize dt.data by calendar year at the beginning.基本上,在开始时dt.data历年组织dt.data可能是有意义的。 But actually I don't really know how to keep the average calculation (daily) variable and general?但实际上我真的不知道如何保持平均计算(每日)可变和一般? Or maybe you should create an extra data table for each calendar year, then calculate the mean values and then merge the columns with the daily mean values back into a common data table?或者您应该为每个日历年创建一个额外的数据表,然后计算平均值,然后将列与每日平均值合并回一个公共数据表? However, this should always remain automated (depending on how many calendar years there are).但是,这应该始终保持自动化(取决于有多少个日历年)。 Unfortunately I have no idea how that could be done.不幸的是,我不知道如何做到这一点。

I hope I was able to ask my question accurately enough and someone can help me with my problem.我希望我能够足够准确地提出我的问题,并且有人可以帮助我解决我的问题。

Yes, it would be better to get data in separate columns for each year.是的,最好在不同的列中获取每年的数据。 We can use pivot_longer for that and create new column based on the pattern in the column names.我们可以pivot_longer使用pivot_longer并根据列名中的模式创建新列。 Once we get that we can just take mean for each Date .一旦我们得到了,我们就可以对每个Datemean

library(dplyr)

dt.data %>%
  tidyr::pivot_longer(cols = -Date, 
               names_to = c('month', '.value'), 
               names_pattern = c('(.*)-(\\d+)')) %>%
  group_by(Date) %>%
  summarise(across(c(matches('^\\d+$')), mean, na.rm  =TRUE))

A base R option without getting the data in long format would be to use split.default .不以长格式获取数据的基本 R 选项是使用split.default We split the data based on year mentioned in the column names and take rowwise mean in each list.我们根据列名中提到的年份拆分数据,并在每个列表中取行均值。

result <- cbind(dt.data[, 1], sapply(split.default(dt.data[, -1], 
      sub('.*-', '', names(dt.data)[-1])), rowMeans, na.rm = TRUE))
names(result)[-1] <- paste0('BRN_Cal-', names(result)[-1])

#           Date BRN_Cal-2021 BRN_Cal-2022
#  1: 2020-01-01     1.974847     2.272833
#  2: 2020-01-02     2.241470     2.399902
#  3: 2020-01-03     1.988883     2.372697
#  4: 2020-01-04     2.057867     2.084504
#  5: 2020-01-05     2.012305     2.049808
# ---                                     
#361: 2020-12-26     2.038167     2.161655
#362: 2020-12-27     2.308974     2.215492
#363: 2020-12-28     2.001359     2.552923
#364: 2020-12-29     2.086283     1.773254
#365: 2020-12-30     1.802871     2.107373

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM