简体   繁体   English

使用R data.table来子集高频时间序列(用data.table替换xts功能)

[英]Using R data.table to subset high frequency time series (a replacement of xts functionality with data.table)

I would like all the data between certain times each day using data.table . 我希望每天使用data.table在特定时间之间的所有数据。

Is this the most efficient (speed wise and memory) way to do these kinds of subsetting? 这是进行这些子集化的最有效(速度和记忆)方式吗?

R.data.table <- data.table(Time = Sys.time() + 1:86400, runif(86400))

R.data.table[Time > as.POSIXct('2016-09-18 08:00:00') & Time < as.POSIXct('2016-09-18 09:00:00')]

I know I can use xts but I like working with data.table because i might use these subsetted data sets for prediction models so I dont need to convert. 我知道我可以使用xts,但我喜欢使用data.table,因为我可能会将这些子集数据集用于预测模型,所以我不需要转换。

I have looked at data.table help on IDate and ITime but I don't really know how to put it all together. 我看过IDateITime data.table帮助,但我真的不知道怎么把它们放在一起。 Are they faster and easy to work with interactively? 它们是否更快速,更易于交互式工作?

For operations like, these are examples I'm not asking for how to do these directly ..., give me all the data for the last 2 business days of each month, all business day hours. 对于类似的操作,这些是示例,我不是要求如何直接执行这些操作...,向我提供每个月的最后2个工作日,所有工作日的所有数据。 Is doing it like I do above the most efficent way to do it or are there better ways to manipulate time series with data tables in R? 是这样做的,就像我在上面做最有效的方法一样,还是有更好的方法来操纵R中数据表的时间序列?

Is doing it like I do above the most efficent way to do it or are there better ways to manipulate time series with data tables in R? 是这样做的,就像我在上面做最有效的方法一样,还是有更好的方法来操纵R中数据表的时间序列?

The most efficient way for these kinds of subsetting (range subsetting) is to use between function. 这类子集 (范围子集)的最有效方法是between函数between使用。 Unfortunately it currently suffers from a bug , thus it is no faster than the approach you are using. 不幸的是,它目前遇到了一个bug ,因此它并不比你正在使用的方法快。 The bug has been fixed , once merged devel package will be published in our CRAN-like repo (including binaries). 一旦合并的devel包将在我们的CRAN类回购(包括二进制文件)中发布,该错误已得到修复 Another reason for using between is that it is more likely it will be internally optimised in future giving speed/memory improvement, as there is still space for improvement. 另一个原因是使用between是,它更可能将在未来给定的速度/记忆改善进行内部优化,现在还有改进的余地。 There is a third way to get expected answer, using non-equi join, but it will be slowest from all three. 使用非equi连接有第三种获得预期答案的方法,但是从这三种方法来看它将是最慢的。

library(data.table)
d = data.table(Time = as.POSIXct("2016-09-18 06:00:00") + 1:86400, runif(86400))
dn = as.POSIXct('2016-09-18 08:00:00')
up = as.POSIXct('2016-09-18 09:00:00')
d[Time > dn & Time < up]
d[between(Time, dn, up, incbounds=FALSE)]
d[.(dn=dn, up=up), on=.(Time>dn, Time<up)]

I have looked at data.table help on IDate and ITime but I don't really know how to put it all together. 我看过IDate和ITime的data.table帮助,但我真的不知道怎么把它们放在一起。 Are they faster and easy to work with interactively? 它们是否更快速,更易于交互式工作?

They can be faster, and are precise. 它们可以更快,更精确。 The I prefix stands for Integer. I前缀代表Integer。 The reason why they were introduced was that POSIXct is a numeric, so suffers from floating point arithmetic problems. 它们之所以被引入是因为POSIXct是一个数字,因此会遇到浮点运算问题。 Joining or grouping of floating point might result in different answers on different platforms. 加入或分组浮点可能会在不同平台上产生不同的答案。 Integer type is much more portable and can be optimised for operations like sorting, or grouping. 整数类型更加便携,可以针对排序或分组等操作进行优化。


There is a pending feature request for more precise datetime data type: Faster internal date/datetime implementation (with ns resolution..) https://github.com/Rdatatable/data.table/issues/1451 有更准确的日期时间数据类型的待处理功能请求:更快的内部日期/日期时间实现(ns分辨率..) https://github.com/Rdatatable/data.table/issues/1451


Also there is a roadmap for new vignettes: timeseries - ordered observations https://github.com/Rdatatable/data.table/issues/3453 , you might want to consult that issue for more features that data.table offers for ordered datasets, obviously it is just a tiny percent of what xts offers, but usually is highly optimised. 还有一个新的插图的路线图: 时间序列 - 有序观察 https://github.com/Rdatatable/data.table/issues/3453 ,您可能想咨询该问题,以获得data.table为有序数据集提供的更多功能,显然它只是xts提供的一小部分,但通常是高度优化的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM