简体   繁体   English

如何使用多个开始和结束日期的输入来计算时间序列中指定日期/时间范围内的摘要统计信息?

[英]How to calculate summary statistics within specified date/time range within time series, using an input of multiple start and end dates?

I have a (dummy) data frame with time series data: 我有一个带有时间序列数据的(虚拟)数据框:

datetime <- as.POSIXct(seq(ISOdate(2012,12,22), ISOdate(2012,12,23), by="hour"), tz='EST')
data <- rnorm(25, 10, 5)
df <- data.frame(datetime, data)

I also have a separate data frame with start and end times as the two columns: 我还有一个单独的数据帧,其中开始时间和结束时间为两列:

start <- as.POSIXct(c('2012/12/22 19:53', '2012/12/22 23:05'), tz='gmt')
end <- as.POSIXct(c('2012/12/22 21:06', '2012/12/22 23:58'), tz='gmt')
index <- data.frame(start, end)

What I'd like to do is "feed" the main data frame 'df' the 'index' data frame, and, for each start and end date/time combination, find the average value of "data" within that date/time range. 我想做的是“馈送”主数据框“ df”和“索引”数据框,并针对每个开始和结束日期/时间组合,找到该日期/时间内“数据”的平均值范围。 This would be equivalent to doing a subset of 'df' manually for each start/end time, but in a combined fashion. 这等效于在每个开始/结束时间手动执行“ df”的子集,但以组合方式进行。 (My real data set has years of data, and a hundred date/time ranges I want to feed it FYI). (我的真实数据集包含多年的数据,我想供其仅供参考的一百个日期/时间范围)。

End goal is to have three columns, start time, end time, and the average numeric value of 'data' within those times. 最终目标是拥有三列,即开始时间,结束时间和这些时间内“数据”的平均数值。

In general you don't want to grow a data frame one row at a time by calling rbind because it is very inefficient (see the second circle of the R inferno for details). 通常,您不希望通过调用rbind来一次增加一行数据帧,因为它效率很低(有关详细信息,请参见R inferno的第二个循环 )。 In your case, you can use sapply to replicate this logic: 在您的情况下,可以使用sapply复制此逻辑:

index$mean <- sapply(1:nrow(index), function(i) mean(df[df$datetime >= index$start[i] &
                                                        df$datetime <= index$end[i],2]))
index
#                 start                 end     mean
# 1 2012-12-22 19:53:00 2012-12-22 21:06:00 9.563336
# 2 2012-12-22 23:05:00 2012-12-22 23:58:00      NaN

I figured out how to do it with a for loop. 我想出了如何用for循环做到这一点。 If anyone has a more efficient solution, that would be great. 如果有人拥有更有效的解决方案,那就太好了。 The for loop solution: for循环解决方案:

d <- data.frame()
for i in (1:nrow(index)) {
    d <- rbind(d, mean(subset(df, datetime >= index[i,1] &
                                  datetime <= index[i,2])[,2]))}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM