[英]R: aggregating time series groups of irregular length
I think this is a split-apply-combine problem, but with a time series twist. 我认为这是一个拆分合并问题,但存在时间序列扭曲。 My data consists of irregular counts and I need to perform some summary statistics on each group of counts.
我的数据由不规则计数组成,我需要对每组计数进行一些汇总统计。 Here's a snapshot of the data:
这是数据的快照:
And here's it is for your console: 这是用于您的控制台的:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. 我不知道如何构造max和cumSum列,尤其是因为每个计数序列的长度都是不规则的。 Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups.
由于我并不总是知道计数系列的起点和终点,因此我迷失在试图找出这些组的索引的时候。 Thanks for your help!
谢谢你的帮助!
UPDATE: here is my for loop for attempting to calculating cumSum. 更新:这是我的for循环,用于尝试计算cumSum。 it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
这不是累积的总和,只是必要的回报,我仍然不确定如何将函数应用于这些范围!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters! 更新2:谢谢评论者!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION: 解:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; 显示的代码与图像中的输出不一致,并且没有提供解释,因此不清楚所需要的操作。 however, the question did mention that the main problem is distinguishing the groups so we will address that.
但是,问题确实提到主要的问题是区分群体,因此我们将解决这个问题。
To do that we compute a new column g
whose rows contain 1 for the first group, 2 for the second and so on. 为此,我们计算一个新列
g
其第一个组的行包含1,第二个组的行包含2,依此类推。 We also remove the NA rows since the g
column is sufficient to distinguish groups. 我们也删除了NA行,因为
g
列足以区分组。
The following code computes a vector the same length as count
by first setting each NA position to FALSE and each non-NA position to TRUE. 下面的代码通过首先将每个NA位置设置为FALSE,将每个非NA位置设置为TRUE,计算与
count
长度相同的向量。 It then differences each position of that vector with the prior position. 然后,它使该向量的每个位置与先前的位置不同。 To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing.
为此,它将隐式将FALSE转换为0,将TRUE转换为1,然后执行差分。 Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise.
接下来,我们将最后的结果转换为逻辑矢量,该逻辑矢量对每个1组件均为TRUE,否则为FALSE。 Since the first component of the vector that is differenced has no prior position we prepend 0 for that.
由于向量的第一个被差分的分量没有在先位置,因此我们为此加0。 The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively.
前置操作隐式地将刚生成的TRUE和FALSE值分别转换为1和0。 Taking the
cumsum
fills in the first group with 1, the second with 2 and so on. 取
cumsum
在第一个组中填充1,在第二个组中填充2,依此类推。 Finally omit the NA rows: 最后省略NA行:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving: 给予:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave
to perform any calculations you like. 现在,您可以使用
ave
执行所需的任何计算。 For example to take cumulative sums of returns by group: 例如,按组取累计收益之和:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum
with any other function that is suitable for use with ave
. 用适用于
ave
任何其他功能替换cumsum
。
Ah, so "count" are the groups and you want the cumsum per group and the max per group. 嗯,所以“计数”是组,您需要每个组的总和和每个组的最大值。 I think in data.table, so here is how I would do it.
我认为在data.table中,所以这是我的方法。
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.