简体   繁体   English

将R中的时间序列数据从秒转换为小时均值

[英]Convert time-series data from seconds to hourly means in R

Note:I have re-framed the previous question as told in the comments. 注意:我已经按照评论中的说明重新构造了上一个问题。

I am using three different packages,ie, dplyr, data.table and xts to aggregate my seconds data to hourly mean representation. 我正在使用三个不同的程序包,即dplyr,data.table和xts,将我的秒数据聚合为小时均值表示。 But, to my surprise xts behaves differently as compared to other two. 但是,令我惊讶的是,xts与其他两个相比行为有所不同。 Issues with xts are: xts的问题是:

  • Results in one extra observation as compared to other two 与其他两个相比,结果多了一个
  • Hourly mean calculated is totally different than the other two 计算的每小时平均值与其他两个小时完全不同

Here is the condensed code for your testing purposes: 以下是用于测试目的的压缩代码:

library(xts)
library(data.table)
library(dplyr)
t2 <- as.POSIXct(seq(from = 1438367408, to = 1440959383, by = 30), origin = "1970-01-01")
dframe <- data.frame(timestamp=t2, power=rnorm(length(t2)))
#using xts
x <- xts(dframe$power,dframe$timestamp)
h1 <- period.apply(x, endpoints(x, "hours"), mean)
h1 <- data.frame(timestamp=trunc(index(h1),'hours'), power=coredata(h1))
#using data.table
h2 <- setDT(dframe)[, list(power= mean(power)) ,(timestamp= as.POSIXct(cut(timestamp, 'hours')))]
#using dpylr
h3 <- dframe %>% group_by(timestamp= as.POSIXct(cut(timestamp, 'hour'))) %>% summarise(power=mean(power))

Outputs in regard to size: 关于规模的产出:

> dim(h1)
[1] 721   2
> dim(h2)
[1] 720   2
> dim(h3)
[1] 720   2

Outputs in regard to Hourly means: 关于每小时的输出表示:

> head(h1)
            timestamp       power
1 2015-08-01 00:00:00  0.04485894
2 2015-08-01 01:00:00 -0.02299071
> head(h2) # equals to head(h2)
             timestamp       power
1: 2015-08-01 00:00:00  0.10057538
2: 2015-08-01 01:00:00 -0.07456292

Extra observation in case of h1: 在h1情况下的额外观察:

> tail(h1)
              timestamp        power
719 2015-08-30 22:00:00  0.069544538
720 2015-08-30 23:00:00  0.011673835
721 2015-08-30 23:00:00 -0.053858563

Clearly for the last hour of day there are two observation. 显然,在一天的最后一小时有两个观察结果。 Normally, there should be only one. 通常,应该只有一个。

My system information: 我的系统信息:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3      data.table_1.9.7 xts_0.9-7        zoo_1.7-12      

loaded via a namespace (and not attached):
 [1] lazyeval_0.1.10 magrittr_1.5    R6_2.1.1        assertthat_0.1  parallel_3.2.2  DBI_0.3.1       tools_3.2.2    
 [8] Rcpp_0.12.1     grid_3.2.2      chron_2.3-47    lattice_0.20-33

Note: 注意:

  • Original dataset can be found at the link 原始数据集可以在链接中找到
  • I want a solution to this issue, because in my implementation scenario xts is nearly 35 times faster than the remaining two 我想要一个解决此问题的方法,因为在我的实现方案中,xts比其余两种方法快35倍

This looks like it might be a bug in endpoints because your local timezone is not a full hour offset from UTC. 看来这可能是endpoints的错误,因为您的本地时区不是UTC的整整一个小时。 I can replicate the issue if I set my local timezone to yours. 如果我将本地时区设置为您的时区,则可以复制该问题。

R> Sys.setenv(TZ="Asia/Kolkata")
R> x <- xts(dframe$power,dframe$timestamp)
R> h <- period.apply(x, endpoints(x, "hours"), mean)
R> head(h)
                        [,1]
2015-08-01 00:29:31 124.9055
2015-08-01 01:29:31 129.7197
2015-08-01 02:29:31 139.0899
2015-08-01 03:29:32 145.6592
2015-08-01 04:29:32 153.6840
2015-08-01 05:29:32 114.4809

Note that the endpoints are at half-hour increments, rather than at the end of the hour. 请注意,端点以半小时为增量,而不是在小时结束时。 This is because Asia/Kolkata is UTC+0530 and endpoints does all its calculations on times represented in UTC. 这是因为亚洲/加尔各答是UTC + 0530,并且endpoints按UTC表示的时间进行所有计算。

You can avoid this by explicitly setting the timezone for the POSIXct object to UTC. 您可以通过将POSIXct对象的时区显式设置为UTC来避免这种情况。

require(xts)
require(dplyr)
require(data.table)
Sys.setenv(TZ="Asia/Kolkata")

dframe <- read.csv("~/ap601.csv",head=TRUE,sep=",")
# set timezone on POSIXct object
dframe$timestamp <- as.POSIXct(dframe$timestamp, tz="UTC")

#using xts
x <- xts(dframe$power, dframe$timestamp)
h <- period.apply(x, endpoints(x, "hours"), mean)
h1 <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
# using data.table
h2 <- setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
# using dplyr
h3 <- dframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))

all.equal(h1$power, h2$power)  # TRUE
all.equal(h1$power, h3$power)  # TRUE

Here's a work-around to get the same results without setting the timezone for the POSIXct column to UTC. 这是一种在不将POSIXct列的时区设置为UTC的情况下获得相同结果的解决方法。 Note that this may not work for timezones with Daylight Saving Time (Asia/Kolkata does not observe any DST). 请注意,这可能不适用于带有夏令时的时区(亚洲/加尔各答未遵守任何夏令时)。

Basically, the idea is to subtract half an hour from the local time when calculating the endpoints , so that the underlying UTC time aligns on the hour. 基本上,这个想法是在计算endpoints时从本地时间减去半小时,以便基础UTC时间与小时对齐。

dframe <- read.csv("~/ap601.csv",head=TRUE,sep=",")
dframe$timestamp <- as.POSIXct(dframe$timestamp)

# subtract half an hour from the index when calculating endpoints
h <- period.apply(x, endpoints(index(x)-3600*0.5, 'hours'), mean)
h1 <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
all.equal(h1$power, h2$power)  # TRUE
all.equal(h1$power, h3$power)  # TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM