简体   繁体   English

R:通过汇总OHLC系列中的值来减少时间序列数据的频率

[英]R: Decrease frequency of time series data by aggregating values in OHLC series

I have a high frequency dataset for foreign exchange rates down to the millisecond which I would like to transform into lower frequency and regular time series data in R, eg minutely or 5-minutely OHLC-series (open, high, low, close). 我有一个高频数据集,可以将毫秒级的外汇汇率转换为R中的低频和常规时间序列数据,例如OHLC系列的每分钟或每5分钟一次(打开,高,低,关闭)。 The original dataset has four columns, one for the exchange rate, one for the timestamp which includes both the date and time as well as columns for the bid and ask-prices as well. 原始数据集有四列,一列是汇率,一列是时间戳,其中既包括日期和时间,也包括出价和要价的列。 The data have been imported from a .csv file. 数据已从.csv文件导入。

{head(GBPUSD)} and {tail(GBPUSD)} returns the following: {head(GBPUSD)}{tail(GBPUSD)}返回以下内容:

# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>  
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763  
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760  
3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759  
4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759


# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>
1 GBP/USD 2017-06-30 20:59:56 1.30093 1.30300  
2 GBP/USD 2017-06-30 20:59:56 1.30121 1.30300  
3 GBP/USD 2017-06-30 20:59:56 1.30100 1.30390  
4 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452  
5 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447  
6 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447  

It seems you want to turn each column (bid, ask) into 4 columns (Open, High, Low, Close), grouped by some time interval like 5 minutes. 似乎您想将每列(出价,询问)分成4列(开盘,高价,低价,收盘价),按5分钟之类的时间间隔进行分组。 I appreciate @dmi3kno showing off a few tibbletime features, but I think that this might do more of what you want. 我感谢@ dmi3kno展示了一些tibbletime功能,但我认为这可能会做更多您想要的事情。

Note that this will change a bit in the next release of tibbletime , but currently under 0.0.2 this works. 请注意,这将在下一个 tibbletime版本中 tibbletime ,但目前在 0.0.2以下可以正常工作。

For each 5 minute period, the Open/High/Low/Close prices of both the bid and ask columns are taken. 对于每5分钟的时段,将分别使用买入价和卖价列的开盘价/最高价/最低价/收盘价。


library(tibbletime)
library(dplyr)

df <- create_series("2017-12-20 00:00:00" ~ "2017-12-20 01:00:00", "sec") %>% 
  mutate(bid = runif(nrow(.)),
         ask = bid + .0001)
df
#> # A time tibble: 3,601 x 3
#> # Index: date
#>    date                   bid    ask
#>  * <dttm>               <dbl>  <dbl>
#>  1 2017-12-20 00:00:00 0.208  0.208 
#>  2 2017-12-20 00:00:01 0.0629 0.0630
#>  3 2017-12-20 00:00:02 0.505  0.505 
#>  4 2017-12-20 00:00:03 0.0841 0.0842
#>  5 2017-12-20 00:00:04 0.986  0.987 
#>  6 2017-12-20 00:00:05 0.225  0.225 
#>  7 2017-12-20 00:00:06 0.536  0.536 
#>  8 2017-12-20 00:00:07 0.767  0.767 
#>  9 2017-12-20 00:00:08 0.994  0.994 
#> 10 2017-12-20 00:00:09 0.807  0.808 
#> # ... with 3,591 more rows

df %>%
  mutate(date = collapse_index(date, "5 min")) %>%
  group_by(date) %>%
  summarise_all(
    .funs = funs(
      open  = dplyr::first(.),
      high  = max(.),
      low   = min(.),
      close = dplyr::last(.)
    )
  )
#> # A time tibble: 13 x 9
#> # Index: date
#>    date                bid_o… ask_o… bid_h… ask_h…  bid_low ask_low bid_c…
#>  * <dttm>               <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>
#>  1 2017-12-20 00:04:59  0.208  0.208  1.000  1.000 0.00293  3.03e⁻³ 0.389 
#>  2 2017-12-20 00:09:59  0.772  0.772  0.997  0.997 0.000115 2.15e⁻⁴ 0.676 
#>  3 2017-12-20 00:14:59  0.457  0.457  0.995  0.996 0.00522  5.32e⁻³ 0.363 
#>  4 2017-12-20 00:19:59  0.586  0.586  0.997  0.997 0.00912  9.22e⁻³ 0.0339
#>  5 2017-12-20 00:24:59  0.385  0.385  0.998  0.998 0.0131   1.32e⁻² 0.0907
#>  6 2017-12-20 00:29:59  0.548  0.548  0.996  0.996 0.00126  1.36e⁻³ 0.320 
#>  7 2017-12-20 00:34:59  0.240  0.240  0.995  0.995 0.00466  4.76e⁻³ 0.153 
#>  8 2017-12-20 00:39:59  0.404  0.405  0.999  0.999 0.000481 5.81e⁻⁴ 0.709 
#>  9 2017-12-20 00:44:59  0.468  0.468  0.999  0.999 0.00101  1.11e⁻³ 0.0716
#> 10 2017-12-20 00:49:59  0.580  0.580  0.996  0.996 0.000336 4.36e⁻⁴ 0.395 
#> 11 2017-12-20 00:54:59  0.242  0.242  0.999  0.999 0.00111  1.21e⁻³ 0.762 
#> 12 2017-12-20 00:59:59  0.474  0.474  0.987  0.987 0.000858 9.58e⁻⁴ 0.335 
#> 13 2017-12-20 01:00:00  0.974  0.974  0.974  0.974 0.974    9.74e⁻¹ 0.974 
#> # ... with 1 more variable: ask_close <dbl>

Update: The post has been updated to reflect the changes in tibbletime 0.1.0 . 更新:帖子已更新,以反映tibbletime 0.1.0的更改。

I changed a little bit the OP's original dataset for pedagogical/instructional reasons below: 由于以下教学/指导原因,我对OP的原始数据集做了一些更改:

df <- data.frame(
X1=c("GBP/USD"), 
X2=c("2017-06-01 00:00:00", "2017-06-01 00:00:00", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:02", "2017-06-30 20:59:52", "2017-06-30 20:59:54", "2017-06-30 20:59:54", "2017-06-30 20:59:56", "2017-06-30 20:59:56", "2017-06-30 20:59:56"), 
X3=c(1.28756, 1.28754, 1.28754, 1.28753, 1.28752, 1.28757, 1.30093, 1.30121, 1.30100, 1.30146, 1.30145,1.30145), 
X4=c(1.28763, 1.28760, 1.28759, 1.28758, 1.28755, 1.28760,1.30300, 1.30300, 1.30390, 1.30452, 1.30447, 1.30447), 
stringsAsFactors=FALSE)

df

        X1                  X2      X3      X4
1  GBP/USD 2017-06-01 00:00:00 1.28756 1.28763
2  GBP/USD 2017-06-01 00:00:00 1.28754 1.28760
3  GBP/USD 2017-06-01 00:00:01 1.28754 1.28759
4  GBP/USD 2017-06-01 00:00:01 1.28753 1.28758
5  GBP/USD 2017-06-01 00:00:01 1.28752 1.28755
6  GBP/USD 2017-06-01 00:00:02 1.28757 1.28760
7  GBP/USD 2017-06-30 20:59:52 1.30093 1.30300
8  GBP/USD 2017-06-30 20:59:54 1.30121 1.30300
9  GBP/USD 2017-06-30 20:59:54 1.30100 1.30390
10 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452
11 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
12 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

Now, in low frequency data, there will be groupings of the same things. 现在,在低频数据中,将有相同内容的分组。 So, we must find the indices corresponding to unique startings, and the endings of the groups: 因此,我们必须找到与唯一的开始和组的结尾相对应的索引:

indices <- seq_along(df[,2])[!(duplicated(df[,2]))] # 1  3  6  7  8 10; the beginnings of groups (observations)
indices - 1   # 0  2  5  6  7   9; for finding the endings of groups
numberoflowfreq <- length(indices) # 6: number of groupings (obs.) for Low Freq data

Understand the pattern by writing openly: 通过公开写作来了解模式:

mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean(df[indices[2]:((indices -1)[3]),3]) # from 3 to 5
mean(df[indices[3]:((indices -1)[4]),3]) # from 6 to 6
mean(df[indices[4]:((indices -1)[5]),3]) # from 7 to 7
mean(df[indices[5]:((indices -1)[6]),3]) # from 8 to 9
mean(df[indices[6]:nrow(df),3]) # from 10 to 12

Simplify the pattern: 简化模式:

mean3rdColumn_1st <- mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean3rdColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),3]) )
mean3rdColumn_Last <- mean(df[indices[6]:nrow(df),3]) # from 10 to 12
# 3rd column in low frequency data:    
c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last)

Similarly for the 4th column: 对于第4列类似:

mean4thColumn_1st <- mean(df[1:((indices -1)[2]),4]) # from 1 to 2
mean4thColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),4]) )
mean4thColumn_Last <- mean(df[indices[6]:nrow(df),4]) # from 10 to 12
# 4th column in low frequency data: 
c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last)

Collect all effort: 收集所有努力:

LowFrqData <- data.frame(X1=c("GBP/USD"), X2=df[indices,2], X3=c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last),   x4=c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last), stringsAsFactors=FALSE)
LowFrqData 

       X1                  X2       X3       x4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487

Now, the column X2 has unique minute values, X3 and X4 were formed by the means of relevant cells. 现在,列X2具有唯一的分钟值, X3X4是通过相关单元格形成的。

Also note that: There may not be values for all minutes in a range. 另请注意:范围内可能没有所有分钟的值。 One may pump NA s for such cases. 在这种情况下,可以抽出NA On the other hand, one may neglect the effect of the irregularity in such cases since the spacing of the observations would/may be the same for many observations, and therefore not so highly irregular. 另一方面,在这种情况下,人们可能会忽略不规则性的影响,因为观察值的间隔对于许多观察而言将是/可能是相同的,因此并不是那么高度不规则。 Also consider the fact that transforming the data into equally spaced observations using linear interpolation can introduce a number of significant and hard to quantify biases (See: Scholes and Williams). 还请考虑以下事实:使用线性插值将数据转换为等距的观测值会引入大量明显且难以量化的偏差(请参阅:Scholes和Williams)。

M. Scholes and J. Williams, “Estimating betas from nonsynchronous data”, Journal of Financial Economics 5: 309–327, 1977. M. Scholes和J. Williams,“从非同步数据估计beta”,《金融经济学杂志》 5:309-327,1977年。

Now, the regular 5-minute series part: 现在,常规的5分钟系列部分:

as.numeric(as.POSIXct("1970-01-01 03:00:00"))  # 0; starting point for ZERO seconds. "1970-01-01 03:01:00" equals 60.
as.numeric(as.POSIXct("2017-06-01 00:00:00")) # 1496264400
# Passed seconds after the first observation in the dataset
PassedSecs <- as.numeric(as.POSIXct(LowFrqData$X2)) - 1496264400

LowFrq5minuteRaw <- cbind(LowFrqData, PassedSecs, stringsAsFactors=FALSE)
LowFrq5minuteRaw

       X1                  X2       X3       x4 PassedSecs
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196

5 minutes means 5*60=300 seconds. 5分钟表示5 * 60 = 300秒。 So, "having same Quotient in Division to 300" groups the observations in 5-minute intervals. 因此,“将相同的商划分为300”将观察结果每隔5分钟分组一次。

LowFrq5minuteRaw2 <- cbind(LowFrqData, PassedSecs, QbyDto300 = PassedSecs%/%300, stringsAsFactors=FALSE)
LowFrq5minuteRaw2

       X1                  X2       X3       x4 PassedSecs QbyDto300
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0         0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1         0
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2         0
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192      8603
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194      8603
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196      8603

indices2 <- seq_along(LowFrq5minuteRaw2[,6])[!(duplicated(LowFrq5minuteRaw2[,6]))] # 1  4; the beginnings of groups

LowFrq5minute <- data.frame(X1=c("GBP/USD"), X2=LowFrq5minuteRaw2[indices2,2], X3=aggregate(LowFrqData[,3] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2], X4=aggregate(LowFrqData[,4] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2])
LowFrq5minute

       X1                  X2       X3       X4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287596
2 GBP/USD 2017-06-30 20:59:52 1.301163 1.303646

X2 is holding the timestamps of 1st occurences of the representatives of 5-minute obs lying on the intervals. X2保持间隔上5分钟的Obs代表第一次发生的时间戳。

I think all these would be easier with aggregate function. 我认为使用aggregate函数可以使所有这些操作变得容易。 Though, based on the data, you might need to convert the datetime column to character (in case the original data holds the millisecond values). 但是,根据数据,您可能需要将datetime列转换为character(以防原始数据保留毫秒值)。 I recommend using lubridate to convert them back to datetime if you need. 如果需要,我建议使用lubridate将其转换回日期时间。

GBPUSD$X2 <- as.character(GBPUSD$X2) #optional; if the below yields bad results
GBPUSD$X2 <- substr(GBPUSD$X2, 1, 19) #optional; to get only upto minutes after above command
# get High values for both bid and ask prices:
GBPUSD_H <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=max)
# get Low values for both bid and ask prices:
GBPUSD_L <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=min)
# merging the High and low values together
GBPUSD_NEW <- data.table::merge(GBPUSD_H, GBPUSD_L, by=c("X1", "X2"), suffixes=c(".HIGH", ".LOW"))

To get all High, Low, Open, & Close values in one shot: 要一次获得所有的高,低,打开和关闭值:

GBPUSD <- data.table(GBPUSD, key=c("X1", "X2"))
GBPUSD_NEW <- GBPUSD[, list(X3.HIGH=max(X3), X3.LOW=min(X3), X3.OPEN=X3[1],
                            X3.CLOSE=X3[length(X3)], X4.HIGH=max(X4), X4.LOW=min(X4),
                            X4.OPEN=X4[1], X4.CLOSE=X4[length(X4)]), by=c("X1", "X2")]

However, for this to work, you first need to sort your data so that the first value is the open and last value is the close value for each second. 但是,要使此方法起作用,首先需要对数据进行排序,以使第一个值是打开的值,最后一个值是每秒的关闭值。

Now, if you need to use minutes instead of seconds (or hours), just adjust the substr accordingly. 现在,如果您需要使用分钟而不是秒(或小时),只需相应地调整substr If you want more customization, like 15 minutes interval, I would suggest adding a helper column. 如果您想进行更多自定义,例如间隔15分钟,我建议添加一个帮助器列。 Sample code: 样例代码:

GBPUSD$MIN <- floor(as.numeric(substr(GBPUSD$X2, 15, 16))/15) #getting 00:00 for 00:00-00:15
GBPUSD$X2 <- paste0(substr(GBPUSD$X2, 1, 14), GBPUSD$MIN, ":00")

Please do not hesitate to ask if your requirement is not fulfilled. 请不要犹豫,询问您的要求是否得到满足。

PS: NA s create problems in aggregate , if the key columns have them. PS:如果键列中有问题,则NA aggregate产生问题。 Deal with them first. 首先处理它们。

GBPUSD$X2[is.na(GBPUSD$X2)] <- "2017:05:05 00:00:00" #example; you need to be careful to use same class and format for the replacement

This is super perfect example when you want to try awesome tibbletime package. 当您想尝试很棒的tibbletime软件包时,这是一个非常完美的示例。 I am going to generate my own data to make a point 我将生成自己的数据来说明问题

library(tibbletime)
df <- tibbletime::create_series(2017-12-20 + 01:06:00 ~ 2017-12-20 + 01:20:00, "sec") %>% 
         mutate(open=runif(nrow(.)),
                close=runif(nrow(.)))
df

This is now a seconds-resolution data for 15 min 现在是15分钟的秒分辨率数据

# A time tibble: 841 x 3
# Index: date
                  date       open       close
 *              <dttm>      <dbl>       <dbl>
 1 2017-12-20 01:06:00 0.63328803 0.357378011
 2 2017-12-20 01:06:01 0.09597444 0.150583962
 3 2017-12-20 01:06:02 0.23601820 0.974341599
 4 2017-12-20 01:06:03 0.71832656 0.092265867
 5 2017-12-20 01:06:04 0.32471587 0.391190310
 6 2017-12-20 01:06:05 0.76378711 0.534765217
 7 2017-12-20 01:06:06 0.92463265 0.694693458
 8 2017-12-20 01:06:07 0.74026638 0.006054806
 9 2017-12-20 01:06:08 0.77064030 0.911641146
10 2017-12-20 01:06:09 0.87130949 0.740816479
# ... with 831 more rows

Changing periodicity of the data is as easy as one command: 更改数据的周期性就像一个命令一样简单:

as_period(df, 5~M)

This will aggregate data to 5 min intervals (tibbletime picks first observation for every period by default, not average or sum) 这会将数据汇总到5分钟间隔(默认情况下,时间间隔是每个时段的第一个观察值,而不是平均值或总和)

# A time tibble: 3 x 3
# Index: date
                 date      open     close
*              <dttm>     <dbl>     <dbl>
1 2017-12-20 01:06:00 0.6332880 0.3573780
2 2017-12-20 01:11:00 0.9235639 0.7043025
3 2017-12-20 01:16:00 0.6955685 0.1641798

Check out this awesome vignette for more details 看看这个很棒的小插图 ,了解更多详细信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM