按R中的data.table计算每组的平均月总数

Question

我有一个data.table，在30年的时间里每天有一行，有许多不同的变量列。 使用data.table的原因是我正在使用的.csv文件很大（大约120万行），因为有一些名为'key'的列所描述的许多组的数据有30年的价值。

示例数据集如下所示：

Key   Date          Runoff
A     1980-01-01    2
A     1980-01-02    1
A     1981-01-01    0.1
A     1981-01-02    3
A     1982-01-01    2
A     1982-01-02    5
B     1980-01-01    1.5
B     1980-01-02    0.5
B     1981-01-01    0.3
B     1981-01-02    2
B     1982-01-01    1.5
B     1982-01-02    4

以上是两个“关键点”的样本，1月份的一些数据显示了我的意思。 实际数据集有每个“密钥”数百个“密钥”和30年的数据。

我想要做的是产生一个输出，其中每个键的每个月的总平均值如下所示：

Key   January  February  March.... etc
A     4.36     ...       ...
B     3.26     ...       ...

即Key A =（2 + 1）+（0.1 + 3）+（2 + 5）/ 3的1月份总平均值

当我在一个三十年的数据集（即只有一个键）上完成此分析时，我已成功使用以下代码来实现此目的：

runoff_tot_average <- rowsum(DF$Runoff, format(DF$Date, '%m')) / 30

DF是一个30年数据集的数据框。

那么我可以请求如何修改上面的代码以使用包含许多“键”的更大数据集或提供全新的解决方案！

编辑

以下代码生成以上数据示例：

Key <- c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02", "1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Key, Date, Runoff)

Answer 1

他们只能通过两个步骤来考虑这样做。 可能不是最好的方式，但是这里

DT[, c("YM", "Month") := list(substr(Date, 1, 7), substr(Date, 6, 7))]
DT[, Runoff2 := sum(Runoff), by = c("Key", "YM")]
DT[, mean(Runoff2), by = c("Key", "Month")]

##   Key Month       V1
## 1:   A    01 4.366667
## 2:   B    01 3.266667

只是为了表明另一种（非常相似）的方式：

DT[, c("year", "month") := list(year(Date), month(Date))]
DT[, Runoff2 := sum(Runoff), by=list(Key, year, month)]
DT[, mean(Runoff2), by=list(Key, month)]

请注意，您不必创建新列，如by支持表现为好。 也就是说，你可以直接使用它们by如下：

DT[, Runoff2 := sum(Runoff), by=list(Key, year = year(Date), month = month(Date))]

但是由于你需要聚合不止一次，所以将它们作为附加列存储会更好（速度），正如@David在这里所示。

Answer 2

如果您不是在寻找复杂的功能而只是想要均值，那么以下内容就足够了：

DT[, sum(Runoff) / length(unique(year(Date))), list(Key, month(Date))]
#   Key month       V1
#1:   A     1 4.366667
#2:   B     1 3.266667

Answer 3

既然你在你的问题中说你会接受一个全新的解决方案，你可以尝试使用dplyr ：

df$Date <- as.Date(df$Date, format="%Y-%m-%d")
df$Year.Month <- format(df$Date, '%Y-%m')
df$Month <- format(df$Date, '%m')

require(dplyr)

df %>%
  group_by(Key, Year.Month, Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  ungroup() %>%
  group_by(Key, Month) %>%
  summarize(mean(Runoff))

@Henrik评论后编辑＃1：同样可以通过以下方式完成：

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  summarize(mean(Runoff))

编辑＃2来解决问题：这是另一种方式（第二种分组更加明确）这要归功于@Henrik的评论

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  group_by(Key, Month, add = FALSE) %>%    #now grouping by Key and Month, but not Year.Month
  summarize(mean(Runoff))

它产生以下结果：

#Source: local data frame [2 x 3]
#Groups: Key
#
#  Key Month mean(Runoff)
#1   A    01     4.366667
#2   B    01     3.266667

然后，您可以使用例如reshape2输出以匹配所需的输出。 假设您将上述操作的输出存储在data.frame df2 ，那么您可以这样做：

require(reshape2)

df2 <- dcast(df2, Key  ~ Month, sum, value.var = "mean(Runoff)")

按R中的data.table计算每组的平均月总数

问题描述

3 个解决方案

解决方案1
11 已采纳 2014-05-13 09:48:33

解决方案2
6 2014-05-13 15:08:11

解决方案3
4 2014-05-13 09:23:31

按R中的data.table计算每组的平均月总数

问题描述

3 个解决方案

解决方案1 11 已采纳 2014-05-13 09:48:33

解决方案2 6 2014-05-13 15:08:11

解决方案3 4 2014-05-13 09:23:31

解决方案1
11 已采纳 2014-05-13 09:48:33

解决方案2
6 2014-05-13 15:08:11

解决方案3
4 2014-05-13 09:23:31