R 中時序操作的性能屬性（主要是 xts 和 data.table）

Question

我正在開展一個新項目，其中包含大型時間序列數據集，從這些數據集中將相關計算饋送到shiny應用程序中。 因此，效率是我感興趣的。 這些操作通常僅限於基本周期轉換和風險度量的后續匯總統計。

我正在研究使用哪個庫/方法來構建計算腳本。 目前，我對xts和data.table 。 盡管我可以使用quantmod和TTR等庫，但我對在生產中部署黑盒功能猶豫不決，並且更願意保持完全的可追溯性。

到目前為止，我已經進行了以下基准測試，其中將每日價格的data.frame轉換為每月收益。 到目前為止使用的包是xts 、 data.table和quantmod （作為參考）。 代碼粘貼在下面，但也可以在GitHub上找到。

基准代碼

# Simple return exercise: Daily Prices to Monthly Returns
# Input: Nx2 data.frame with columns (N days, price) 
# Output: Mx2 object with columns (M months, return)
# Three different functions: 1. xts, 2. data.table, 3. quantmod

rm(list = ls()); gc()

library(data.table) 
library(zoo)
library(xts)
library(ggplot2)
library(quantmod)

# Asset params
spot = 100
r = 0.01
sigma = 0.02
N = 1e5

# Input data: Nx2 data.frame (date, price)
pmat = data.frame( 
    date = seq.Date(as.Date('1970-01-01'), by = 1, length.out = N),
    price = spot * exp(cumsum((r - 0.5 * sigma**2) * 1/N + (sigma * (sqrt(1/N)) * rnorm(N, mean = 0, sd = 1))))
)

# Output functions

      # 1. xts standalone 
      xtsfun = function(mat){
        xtsdf = as.xts(mat[, 2], order.by = mat[, 1])
        eom_prices = to.monthly(xtsdf)[, 4]
        mret = eom_prices/lag.xts(eom_prices) - 1; mret[1] = eom_prices[1]/xtsdf[1] - 1
        mret
      }
      
      # 2. data.table standalone 
      dtfun = function(mat){
        dt = setNames(as.data.table(mat), c('V1', 'V2'))
        dt[, .(EOM = last(V2)), .(Month = as.yearmon(V1))][, .(Month, Return = EOM/shift(EOM, fill = first(mat[, 2])) - 1)]
      }
      
      # 3. quantmod (black box library)
      qmfun = function(mat){
        qmdf = as.xts(mat[, 2], order.by = mat[, 1])
        monthlyReturn(qmdf)
      }

# Check 1 == 2 == 3:
all.equal(
    unlist(dtfun(pmat[1:1000,])[, Return]),
    as.numeric(xtsfun(pmat[1:1000,])),
    as.numeric(qmfun(pmat[1:1000,])),
    scale = NULL
)
    
# Benchmark
library(microbenchmark)
gc()

mbm = microbenchmark(
  xts = xtsfun(pmat),
  data.table = dtfun(pmat),
  quantmod = qmfun(pmat),
  times = 50
)

mbm

結果

對於N = 1e5 ，三種方法的執行方式相似：

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
        xts 20.62520 22.93372 25.14445 23.84235 27.25468  39.29402    50
 data.table 21.23984 22.29121 27.28266 24.05491 26.25416  98.35812    50
   quantmod 14.21228 16.71663 19.54709 17.19368 19.38106 102.56189    50

但是，對於N = 1e6 ，我觀察到data.table的性能差異很大：

Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
        xts  296.8969  380.7494  408.7696  397.4292  431.1306  759.7227    50
 data.table 1562.3613 1637.8787 1669.8513 1651.4729 1688.2312 1969.4942    50
   quantmod  144.1901  244.2427  278.7676  268.4302  331.4777  418.7951    50

我很好奇是什么驅動了這個結果，特別是因為data.table通常在大N上表現出色。 當然， dtfun可能寫得不好（我非常感謝任何代碼改進），但我使用其他方法獲得了類似的結果，包括 EOM 日期的自聯接和每日回報的cumprod 。

xts和/或quantmod從任何內部rcpp或 eqv 調用中受益，從而提高了它們的大規模性能？ 最后，如果您知道任何其他有競爭力的獨立解決方案（ base ？， dplyr ？）用於大型 TS，我全都在聽。

Answer 1

答案在於data.table的date處理。 本質上，它采用了相對較慢的ISOdate格式。 相反，當實施基於整數的date分組時，結果將有利於data.table 。

我已經使用xts和data.table的更新解決方案更新了TSBenchmark存儲庫。 我非常感謝Joshua Ulrich和Matt Dowle提供的改進，他們應該得到充分的贊揚。

R 中時序操作的性能屬性（主要是 xts 和 data.table）

問題描述

1 個解決方案

解決方案1
0 已采納 2021-02-14 16:35:49

R 中時序操作的性能屬性（主要是 xts 和 data.table）

問題描述

1 個解決方案

解決方案1 0 已采納 2021-02-14 16:35:49

解決方案1
0 已采納 2021-02-14 16:35:49