简体   繁体   English

按R中的data.table计算每组的平均月总数

[英]Calculate average monthly total by groups from data.table in R

I have a data.table with a row for each day over a 30 year period with a number of different variable columns. 我有一个data.table,在30年的时间里每天有一行,有许多不同的变量列。 The reason for using data.table is that the .csv file I'm using is huge (approx 1.2 million rows) as there are 30 years worth of data for a number of groups charactertised by a column called 'key'. 使用data.table的原因是我正在使用的.csv文件很大(大约120万行),因为有一些名为'key'的列所描述的许多组的数据有30年的价值。

An example dataset is shown below: 示例数据集如下所示:

Key   Date          Runoff
A     1980-01-01    2
A     1980-01-02    1
A     1981-01-01    0.1
A     1981-01-02    3
A     1982-01-01    2
A     1982-01-02    5
B     1980-01-01    1.5
B     1980-01-02    0.5
B     1981-01-01    0.3
B     1981-01-02    2
B     1982-01-01    1.5
B     1982-01-02    4

The above is a sample of two 'keys', with some data for January over three years to show what I mean. 以上是两个“关键点”的样本,1月份的一些数据显示了我的意思。 The actual dataset has hundreds of 'keys' and 30 years worth of data for each 'key'. 实际数据集有每个“密钥”数百个“密钥”和30年的数据。

What I want to do is produce an output that has the total average for each month for each key as is shown below: 我想要做的是产生一个输出,其中每个键的每个月的总平均值如下所示:

Key   January  February  March.... etc
A     4.36     ...       ...
B     3.26     ...       ...

ie the total average for January for Key A = (2 + 1) + (0.1 + 3) + (2 + 5) / 3 即Key A =(2 + 1)+(0.1 + 3)+(2 + 5)/ 3的1月份总平均值

When I have done this analysis on one thirty year dataset (ie just one key) I have used the following code successfully to achieve this: 当我在一个三十年的数据集(即只有一个键)上完成此分析时,我已成功使用以下代码来实现此目的:

runoff_tot_average <- rowsum(DF$Runoff, format(DF$Date, '%m')) / 30

Where DF is the dataframe for one 30 year dataset. DF是一个30年数据集的数据框。

So could I please have suggestions on how to modify my code above to work with the larger dataset with many 'keys' or offer a completely new solution! 那么我可以请求如何修改上面的代码以使用包含许多“键”的更大数据集或提供全新的解决方案!

EDIT 编辑

The below code produces the above data example: 以下代码生成以上数据示例:

Key <- c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02", "1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Key, Date, Runoff)

They only way I could think of doing it was in two steps. 他们只能通过两个步骤来考虑这样做。 Probably not the best way, but here goes 可能不是最好的方式,但是这里

DT[, c("YM", "Month") := list(substr(Date, 1, 7), substr(Date, 6, 7))]
DT[, Runoff2 := sum(Runoff), by = c("Key", "YM")]
DT[, mean(Runoff2), by = c("Key", "Month")]

##   Key Month       V1
## 1:   A    01 4.366667
## 2:   B    01 3.266667

Just to show another (very similar) way: 只是为了表明另一种(非常相似)的方式:

DT[, c("year", "month") := list(year(Date), month(Date))]
DT[, Runoff2 := sum(Runoff), by=list(Key, year, month)]
DT[, mean(Runoff2), by=list(Key, month)]

Note that you don't have to create new columns, as by supports expressions as well. 请注意,您不必创建新列,如by支持表现为好。 That is, you can directly use them in by as follows: 也就是说,你可以直接使用它们by如下:

DT[, Runoff2 := sum(Runoff), by=list(Key, year = year(Date), month = month(Date))]

But since you require to aggregate more than once, it's better (for speed) to store them as additional columns, as @David has shown here. 但是由于你需要聚合不止一次,所以将它们作为附加列存储会更好(速度),正如@David在这里所示。

If you're not looking for complicated functions and just want the mean, then the following should suffice: 如果您不是在寻找复杂的功能而只是想要均值,那么以下内容就足够了:

DT[, sum(Runoff) / length(unique(year(Date))), list(Key, month(Date))]
#   Key month       V1
#1:   A     1 4.366667
#2:   B     1 3.266667

Since you said in your question that you would be open to a completely new solution, you could try the following with dplyr : 既然你在你的问题中说你会接受一个全新的解决方案,你可以尝试使用dplyr

df$Date <- as.Date(df$Date, format="%Y-%m-%d")
df$Year.Month <- format(df$Date, '%Y-%m')
df$Month <- format(df$Date, '%m')

require(dplyr)

df %>%
  group_by(Key, Year.Month, Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  ungroup() %>%
  group_by(Key, Month) %>%
  summarize(mean(Runoff))

EDIT #1 after comment by @Henrik: The same can be done by: @Henrik评论后编辑#1:同样可以通过以下方式完成:

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  summarize(mean(Runoff))

EDIT #2 to round things up: This is another way of doing it (the second grouping is more explicit this way) thanks to @Henrik for his comments 编辑#2来解决问题:这是另一种方式(第二种分组更加明确)这要归功于@Henrik的评论

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  group_by(Key, Month, add = FALSE) %>%    #now grouping by Key and Month, but not Year.Month
  summarize(mean(Runoff))

It produces the following result: 它产生以下结果:

#Source: local data frame [2 x 3]
#Groups: Key
#
#  Key Month mean(Runoff)
#1   A    01     4.366667
#2   B    01     3.266667

You can then reshape the output to match your desired output using eg reshape2 . 然后,您可以使用例如reshape2输出以匹配所需的输出。 Suppose you stored the output of the above operation in a data.frame df2 , then you could do: 假设您将上述操作的输出存储在data.frame df2 ,那么您可以这样做:

require(reshape2)

df2 <- dcast(df2, Key  ~ Month, sum, value.var = "mean(Runoff)")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM