[英]Splitting irregular time series into regular monthly averages - R
In order to establish seasonal effects on energy use, I need to align the energy use information that I have from a billing database with monthly temperatures. 为了对能源使用产生季节性影响,我需要将计费数据库中的能源使用信息与月度温度保持一致。
I'm working with a billing dataset that has bills of varying lengths and start and end dates, and I'd like to obtain the monthly average for each account within each month. 我正在使用具有不同长度和开始日期和结束日期的帐单的结算数据集,并且我想获得每个月内每个帐户的月平均值。 For example, I have a billing database that has the following characteristics:
例如,我有一个具有以下特征的计费数据库:
acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35
I would like to figure out how to coerce these somewhat irregular time series (for each account) to get the average amount per day within each month that is spanned within each bill, such that: 我想弄清楚如何强制这些有点不规则的时间序列(对于每个帐户)来获得每个月内每个月内的平均金额,以便:
acct amount begin end days avgamtpday
1 2242 11349 2009-10-01 2009-10-31 31 X
2 2242 12252 2009-11-01 2009-11-30 30 X
3 2242 21774 2009-12-01 2010-12-31 31 X
4 2242 18293 2010-01-01 2010-01-31 31 X
4 2242 18293 2010-02-01 2010-02-28 28 X
5 2243 27217 2009-10-01 2009-10-31 31 X
6 2243 117 2009-11-01 2009-11-30 30 X
7 2243 14543 2009-12-01 2009-12-31 30 X
7 2243 14543 2010-01-01 2010-01-31 31 X
I'm fairly agnostic to whichever tool can do this, since I only have to do this once. 我完全不知道哪种工具可以做到这一点,因为我只需要这样做一次。
An additional wrinkle is the table is about 150,000 rows long, which is not really very big by most standards, but big enough to make a loop solution in R difficult. 另外一个皱纹是桌子长约150,000行,这在大多数标准下并不是很大,但是足够大以使R中的循环解决方案变得困难。 I've investigated using the zoo, xts, and tempdisagg packages in R. I started writing a really ugly loop that would split each bill, then create one row for each month within an existing bill, and then tapply() to summarize by accts and months, but honestly, couldn't see how to do it efficiently.
我已经调查过使用R中的zoo,xts和tempdisagg软件包。我开始编写一个非常丑陋的循环来分割每个帐单,然后在现有帐单中每个月创建一行,然后按应用程序汇总tapply()和几个月,但老实说,看不出如何有效地做到这一点。
In MySQL, I've tried this: 在MySQL中,我试过这个:
create or replace view v3 as select 1 n union all select 1 union all select 1;
创建或替换视图v3为select 1 n union all select 1 union all select 1;
create or replace view v as select 1 n from v3 a, v3 b union all select 1;创建或替换视图v作为选择1 n从v3 a,v3 b union all select 1;
set @n = 0;设为@n = 0;
drop table if exists calendar;删除表如果存在日历; create table calendar(dt date primary key);
创建表日历(dt日期主键);
insert into calendar插入日历
select cast('2008-1-1' + interval @n:=@n+1 day as date) as dt from va, vb, vc, vd, ve, v;从va,vb,vc,vd,ve,v中选择演员表('2008-1-1'+ interval @n:= @ n + 1天作为日期)作为dt;
select acct, amount, begin, end, billAmtPerDay, sum(billAmtPerDay), MonthAmt, count( ) Days, sum(billAmtPerDay)/count( ) AverageAmtPerDay, year(dt), month(dt) FROM ( select *, amount/days billAmtPerDay from bills b inner join calendar c on dt between begin and end and begin <> dt) x group by acct, amount, begin, end, billAmtPerDay, year(dt), month(dt);
select acct,amount,begin,end,billAmtPerDay,sum(billAmtPerDay),MonthAmt,count( )天,sum(billAmtPerDay)/ count( )AverageAmtPerDay,year(dt),month(dt)FROM(select *,amount / days billAmtPerDay来自账单b内部联接日历c在dt之间的开始和结束之间并开始<> dt)x group by acct,amount,begin,end,billAmtPerDay,year(dt),month(dt);
But for reasons I don't understand, my server doesn't like this table, and gets hung up on the inner join, even when I stage the different calculations. 但由于我不明白的原因,我的服务器不喜欢这个表,并且挂起内连接,即使我进行不同的计算。 I'm investigating if there are any temporary memory limits on it.
我正在调查是否有任何临时内存限制。
Thanks! 谢谢!
Here's a start using data.table
: 这是一个使用
data.table
的开始:
billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)
require(data.table)
DT = as.data.table(billdata)
First, change type of columns begin
and end
to dates. 首先,更改列的类型
begin
和end
日期。 Unlike data.frame, this doesn't copy the entire dataset. 与data.frame不同,它不会复制整个数据集。
DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]
Then find the time span, find the prevailing bill for each day, and aggregate. 然后找出时间跨度,找到每天的现行账单,并汇总。
alldays = DT[,seq(min(begin),max(end),by="day")]
setkey(DT, acct, begin)
DT[CJ(unique(acct),alldays),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429
I think you'll find the prevailing join logic quite cumbersome in SQL, and slower. 我认为你会发现流行的连接逻辑在SQL中非常麻烦,而且速度较慢。
I say it's a hint because it's not quite correct. 我说这是一个提示,因为它不太正确。 Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could
rbind
in the last row for each account and use rolltolast
instead of roll
. 注意第10行是重复的,因为帐户2243不会延伸到
rolltolast
而不像帐户2242.要完成它,您可以在每个帐户的最后一行中进行rbind
并使用rolltolast
而不是roll
。 Or perhaps create alldays
by account rather than across all accounts. 或者也许通过帐户而不是所有帐户创建
alldays
。
See if speed is acceptable on the above, and we can go from there. 看看上面的速度是否可以接受,我们可以从那里开始。
It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. 您可能会遇到1.8.2中已修复的1.8.2中的错误。 I'm using v1.8.3.
我正在使用v1.8.3。
"Internal" error message when combining join containing missing groups and group by is fixed, #2162.
组合包含缺失组和分组依据的连接时出现“内部”错误消息,#2162。 For example : X[Y,.N,by=NonJoinColumn] where Y contains some rows that don't match to X. This bug could also result in a seg fault.
例如:X [Y,.N,by = NonJoinColumn]其中Y包含一些与X不匹配的行。此错误也可能导致seg错误。
Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge. 让我知道,我们可以解决,或从R-Forge升级到1.8.3。
Btw, nice example data. 顺便说一句,很好的示例数据。 That made it quicker to answer.
这使得答案更快。
Here's the full answer alluded to above. 以上是上面提到的完整答案。 It's a bit tricky I have to admit, as it combines together several features of
data.table
. 我不得不承认这有点棘手,因为它结合了
data.table
几个特性。 This should work in 1.8.2 as it happens, but I've only tested in 1.8.3. 这应该在1.8.2中有效,但我只在1.8.3中测试过。
DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
Here is one way to do it: 这是一种方法:
billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)
#First, declare your columns "begin" and "end" as dates:
strptime(billdata$begin, format="%Y-%m-%d") -> billdata$begin
strptime(billdata$end, format="%Y-%m-%d") -> billdata$end
#Then create a column with the amount per day on the billing period:
billdata$avg_on_period<-billdata$amount/billdata$days
#Then split it into days:
temp <- data.frame(acct=c(),month=c(),day=c(), avg=c())
for(i in 1:nrow(billdata)){
X <- billdata[i,]
seq(X$begin,X$end,by="day") -> list_day
rbind(temp, data.frame(acct=rep(X$acct,length(list_day)),
month=format(list_day, "%Y-%m"), day=format(list_day, "%d"),
avg=rep(X$avg_on_period, length(list_day)))) -> temp
}
# And finally merge the different days of the months together:
output<-aggregate(temp$avg, by=list(temp$month,temp$acct), FUN=mean)
colnames(output) <- c("Month","Account","Average per day")
output
Month Account Average per day
1 2009-10 2242 391.34483
2 2009-11 2242 406.69448
3 2009-12 2242 595.40000
4 2010-01 2242 645.51964
5 2010-02 2242 653.32143
6 2009-10 2243 938.51724
7 2009-11 2243 97.36172
8 2009-12 2243 364.06250
9 2010-01 2243 415.51429
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.