将R中的时间序列数据从秒转换为小时均值

Question

Note:I have re-framed the previous question as told in the comments. 注意：我已经按照评论中的说明重新构造了上一个问题。

I am using three different packages,ie, dplyr, data.table and xts to aggregate my seconds data to hourly mean representation. 我正在使用三个不同的程序包，即dplyr，data.table和xts，将我的秒数据聚合为小时均值表示。 But, to my surprise xts behaves differently as compared to other two. 但是，令我惊讶的是，xts与其他两个相比行为有所不同。 Issues with xts are: xts的问题是：

Results in one extra observation as compared to other two 与其他两个相比，结果多了一个
Hourly mean calculated is totally different than the other two 计算的每小时平均值与其他两个小时完全不同

Here is the condensed code for your testing purposes: 以下是用于测试目的的压缩代码：

library(xts)
library(data.table)
library(dplyr)
t2 <- as.POSIXct(seq(from = 1438367408, to = 1440959383, by = 30), origin = "1970-01-01")
dframe <- data.frame(timestamp=t2, power=rnorm(length(t2)))
#using xts
x <- xts(dframe$power,dframe$timestamp)
h1 <- period.apply(x, endpoints(x, "hours"), mean)
h1 <- data.frame(timestamp=trunc(index(h1),'hours'), power=coredata(h1))
#using data.table
h2 <- setDT(dframe)[, list(power= mean(power)) ,(timestamp= as.POSIXct(cut(timestamp, 'hours')))]
#using dpylr
h3 <- dframe %>% group_by(timestamp= as.POSIXct(cut(timestamp, 'hour'))) %>% summarise(power=mean(power))

Outputs in regard to size: 关于规模的产出：

> dim(h1)
[1] 721   2
> dim(h2)
[1] 720   2
> dim(h3)
[1] 720   2

Outputs in regard to Hourly means: 关于每小时的输出表示：

> head(h1)
            timestamp       power
1 2015-08-01 00:00:00  0.04485894
2 2015-08-01 01:00:00 -0.02299071
> head(h2) # equals to head(h2)
             timestamp       power
1: 2015-08-01 00:00:00  0.10057538
2: 2015-08-01 01:00:00 -0.07456292

Extra observation in case of h1: 在h1情况下的额外观察：

> tail(h1)
              timestamp        power
719 2015-08-30 22:00:00  0.069544538
720 2015-08-30 23:00:00  0.011673835
721 2015-08-30 23:00:00 -0.053858563

Clearly for the last hour of day there are two observation. 显然，在一天的最后一小时有两个观察结果。 Normally, there should be only one. 通常，应该只有一个。

My system information: 我的系统信息：

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3      data.table_1.9.7 xts_0.9-7        zoo_1.7-12      

loaded via a namespace (and not attached):
 [1] lazyeval_0.1.10 magrittr_1.5    R6_2.1.1        assertthat_0.1  parallel_3.2.2  DBI_0.3.1       tools_3.2.2    
 [8] Rcpp_0.12.1     grid_3.2.2      chron_2.3-47    lattice_0.20-33

Note: 注意：

Original dataset can be found at the link 原始数据集可以在链接中找到
I want a solution to this issue, because in my implementation scenario xts is nearly 35 times faster than the remaining two 我想要一个解决此问题的方法，因为在我的实现方案中，xts比其余两种方法快35倍

Answer 1

This looks like it might be a bug in endpoints because your local timezone is not a full hour offset from UTC. 看来这可能是endpoints的错误，因为您的本地时区不是UTC的整整一个小时。 I can replicate the issue if I set my local timezone to yours. 如果我将本地时区设置为您的时区，则可以复制该问题。

R> Sys.setenv(TZ="Asia/Kolkata")
R> x <- xts(dframe$power,dframe$timestamp)
R> h <- period.apply(x, endpoints(x, "hours"), mean)
R> head(h)
                        [,1]
2015-08-01 00:29:31 124.9055
2015-08-01 01:29:31 129.7197
2015-08-01 02:29:31 139.0899
2015-08-01 03:29:32 145.6592
2015-08-01 04:29:32 153.6840
2015-08-01 05:29:32 114.4809

Note that the endpoints are at half-hour increments, rather than at the end of the hour. 请注意，端点以半小时为增量，而不是在小时结束时。 This is because Asia/Kolkata is UTC+0530 and endpoints does all its calculations on times represented in UTC. 这是因为亚洲/加尔各答是UTC + 0530，并且endpoints按UTC表示的时间进行所有计算。

You can avoid this by explicitly setting the timezone for the POSIXct object to UTC. 您可以通过将POSIXct对象的时区显式设置为UTC来避免这种情况。

require(xts)
require(dplyr)
require(data.table)
Sys.setenv(TZ="Asia/Kolkata")

dframe <- read.csv("~/ap601.csv",head=TRUE,sep=",")
# set timezone on POSIXct object
dframe$timestamp <- as.POSIXct(dframe$timestamp, tz="UTC")

#using xts
x <- xts(dframe$power, dframe$timestamp)
h <- period.apply(x, endpoints(x, "hours"), mean)
h1 <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
# using data.table
h2 <- setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
# using dplyr
h3 <- dframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))

all.equal(h1$power, h2$power)  # TRUE
all.equal(h1$power, h3$power)  # TRUE

Here's a work-around to get the same results without setting the timezone for the POSIXct column to UTC. 这是一种在不将POSIXct列的时区设置为UTC的情况下获得相同结果的解决方法。 Note that this may not work for timezones with Daylight Saving Time (Asia/Kolkata does not observe any DST). 请注意，这可能不适用于带有夏令时的时区（亚洲/加尔各答未遵守任何夏令时）。

Basically, the idea is to subtract half an hour from the local time when calculating the endpoints , so that the underlying UTC time aligns on the hour. 基本上，这个想法是在计算endpoints时从本地时间减去半小时，以便基础UTC时间与小时对齐。

dframe <- read.csv("~/ap601.csv",head=TRUE,sep=",")
dframe$timestamp <- as.POSIXct(dframe$timestamp)

# subtract half an hour from the index when calculating endpoints
h <- period.apply(x, endpoints(index(x)-3600*0.5, 'hours'), mean)
h1 <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
all.equal(h1$power, h2$power)  # TRUE
all.equal(h1$power, h3$power)  # TRUE

将R中的时间序列数据从秒转换为小时均值

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-10-20 17:37:56

将R中的时间序列数据从秒转换为小时均值

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-10-20 17:37:56

解决方案1
2 已采纳 2015-10-20 17:37:56