[英]R: Decrease frequency of time series data by aggregating values in OHLC series
我有一個高頻數據集,可以將毫秒級的外匯匯率轉換為R中的低頻和常規時間序列數據,例如OHLC系列的每分鍾或每5分鍾一次(打開,高,低,關閉)。 原始數據集有四列,一列是匯率,一列是時間戳,其中既包括日期和時間,也包括出價和要價的列。 數據已從.csv
文件導入。
{head(GBPUSD)}
和{tail(GBPUSD)}
返回以下內容:
# A tibble: 6 x 4
X1 X2 X3 X4
<chr> <dttm> <dbl> <dbl>
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760
3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759
4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759
5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759
6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759
# A tibble: 6 x 4
X1 X2 X3 X4
<chr> <dttm> <dbl> <dbl>
1 GBP/USD 2017-06-30 20:59:56 1.30093 1.30300
2 GBP/USD 2017-06-30 20:59:56 1.30121 1.30300
3 GBP/USD 2017-06-30 20:59:56 1.30100 1.30390
4 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452
5 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
6 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
似乎您想將每列(出價,詢問)分成4列(開盤,高價,低價,收盤價),按5分鍾之類的時間間隔進行分組。 我感謝@ dmi3kno展示了一些tibbletime
功能,但我認為這可能會做更多您想要的事情。
請注意,這將在下一個
tibbletime
版本中
tibbletime
,但目前在
0.0.2
以下可以正常工作。
對於每5分鍾的時段,將分別使用買入價和賣價列的開盤價/最高價/最低價/收盤價。
library(tibbletime)
library(dplyr)
df <- create_series("2017-12-20 00:00:00" ~ "2017-12-20 01:00:00", "sec") %>%
mutate(bid = runif(nrow(.)),
ask = bid + .0001)
df
#> # A time tibble: 3,601 x 3
#> # Index: date
#> date bid ask
#> * <dttm> <dbl> <dbl>
#> 1 2017-12-20 00:00:00 0.208 0.208
#> 2 2017-12-20 00:00:01 0.0629 0.0630
#> 3 2017-12-20 00:00:02 0.505 0.505
#> 4 2017-12-20 00:00:03 0.0841 0.0842
#> 5 2017-12-20 00:00:04 0.986 0.987
#> 6 2017-12-20 00:00:05 0.225 0.225
#> 7 2017-12-20 00:00:06 0.536 0.536
#> 8 2017-12-20 00:00:07 0.767 0.767
#> 9 2017-12-20 00:00:08 0.994 0.994
#> 10 2017-12-20 00:00:09 0.807 0.808
#> # ... with 3,591 more rows
df %>%
mutate(date = collapse_index(date, "5 min")) %>%
group_by(date) %>%
summarise_all(
.funs = funs(
open = dplyr::first(.),
high = max(.),
low = min(.),
close = dplyr::last(.)
)
)
#> # A time tibble: 13 x 9
#> # Index: date
#> date bid_o… ask_o… bid_h… ask_h… bid_low ask_low bid_c…
#> * <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2017-12-20 00:04:59 0.208 0.208 1.000 1.000 0.00293 3.03e⁻³ 0.389
#> 2 2017-12-20 00:09:59 0.772 0.772 0.997 0.997 0.000115 2.15e⁻⁴ 0.676
#> 3 2017-12-20 00:14:59 0.457 0.457 0.995 0.996 0.00522 5.32e⁻³ 0.363
#> 4 2017-12-20 00:19:59 0.586 0.586 0.997 0.997 0.00912 9.22e⁻³ 0.0339
#> 5 2017-12-20 00:24:59 0.385 0.385 0.998 0.998 0.0131 1.32e⁻² 0.0907
#> 6 2017-12-20 00:29:59 0.548 0.548 0.996 0.996 0.00126 1.36e⁻³ 0.320
#> 7 2017-12-20 00:34:59 0.240 0.240 0.995 0.995 0.00466 4.76e⁻³ 0.153
#> 8 2017-12-20 00:39:59 0.404 0.405 0.999 0.999 0.000481 5.81e⁻⁴ 0.709
#> 9 2017-12-20 00:44:59 0.468 0.468 0.999 0.999 0.00101 1.11e⁻³ 0.0716
#> 10 2017-12-20 00:49:59 0.580 0.580 0.996 0.996 0.000336 4.36e⁻⁴ 0.395
#> 11 2017-12-20 00:54:59 0.242 0.242 0.999 0.999 0.00111 1.21e⁻³ 0.762
#> 12 2017-12-20 00:59:59 0.474 0.474 0.987 0.987 0.000858 9.58e⁻⁴ 0.335
#> 13 2017-12-20 01:00:00 0.974 0.974 0.974 0.974 0.974 9.74e⁻¹ 0.974
#> # ... with 1 more variable: ask_close <dbl>
更新:帖子已更新,以反映tibbletime 0.1.0
的更改。
由於以下教學/指導原因,我對OP的原始數據集做了一些更改:
df <- data.frame(
X1=c("GBP/USD"),
X2=c("2017-06-01 00:00:00", "2017-06-01 00:00:00", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:02", "2017-06-30 20:59:52", "2017-06-30 20:59:54", "2017-06-30 20:59:54", "2017-06-30 20:59:56", "2017-06-30 20:59:56", "2017-06-30 20:59:56"),
X3=c(1.28756, 1.28754, 1.28754, 1.28753, 1.28752, 1.28757, 1.30093, 1.30121, 1.30100, 1.30146, 1.30145,1.30145),
X4=c(1.28763, 1.28760, 1.28759, 1.28758, 1.28755, 1.28760,1.30300, 1.30300, 1.30390, 1.30452, 1.30447, 1.30447),
stringsAsFactors=FALSE)
df
X1 X2 X3 X4
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760
3 GBP/USD 2017-06-01 00:00:01 1.28754 1.28759
4 GBP/USD 2017-06-01 00:00:01 1.28753 1.28758
5 GBP/USD 2017-06-01 00:00:01 1.28752 1.28755
6 GBP/USD 2017-06-01 00:00:02 1.28757 1.28760
7 GBP/USD 2017-06-30 20:59:52 1.30093 1.30300
8 GBP/USD 2017-06-30 20:59:54 1.30121 1.30300
9 GBP/USD 2017-06-30 20:59:54 1.30100 1.30390
10 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452
11 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
12 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
現在,在低頻數據中,將有相同內容的分組。 因此,我們必須找到與唯一的開始和組的結尾相對應的索引:
indices <- seq_along(df[,2])[!(duplicated(df[,2]))] # 1 3 6 7 8 10; the beginnings of groups (observations)
indices - 1 # 0 2 5 6 7 9; for finding the endings of groups
numberoflowfreq <- length(indices) # 6: number of groupings (obs.) for Low Freq data
通過公開寫作來了解模式:
mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean(df[indices[2]:((indices -1)[3]),3]) # from 3 to 5
mean(df[indices[3]:((indices -1)[4]),3]) # from 6 to 6
mean(df[indices[4]:((indices -1)[5]),3]) # from 7 to 7
mean(df[indices[5]:((indices -1)[6]),3]) # from 8 to 9
mean(df[indices[6]:nrow(df),3]) # from 10 to 12
簡化模式:
mean3rdColumn_1st <- mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean3rdColumn_Between <- sapply(2:(numberoflowfreq-1), function(i) mean(df[indices[i]:((indices -1)[i+1]),3]) )
mean3rdColumn_Last <- mean(df[indices[6]:nrow(df),3]) # from 10 to 12
# 3rd column in low frequency data:
c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last)
對於第4列類似:
mean4thColumn_1st <- mean(df[1:((indices -1)[2]),4]) # from 1 to 2
mean4thColumn_Between <- sapply(2:(numberoflowfreq-1), function(i) mean(df[indices[i]:((indices -1)[i+1]),4]) )
mean4thColumn_Last <- mean(df[indices[6]:nrow(df),4]) # from 10 to 12
# 4th column in low frequency data:
c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last)
收集所有努力:
LowFrqData <- data.frame(X1=c("GBP/USD"), X2=df[indices,2], X3=c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last), x4=c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last), stringsAsFactors=FALSE)
LowFrqData
X1 X2 X3 x4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487
現在,列X2
具有唯一的分鍾值, X3
和X4
是通過相關單元格形成的。
另請注意:范圍內可能沒有所有分鍾的值。 在這種情況下,可以抽出NA
。 另一方面,在這種情況下,人們可能會忽略不規則性的影響,因為觀察值的間隔對於許多觀察而言將是/可能是相同的,因此並不是那么高度不規則。 還請考慮以下事實:使用線性插值將數據轉換為等距的觀測值會引入大量明顯且難以量化的偏差(請參閱:Scholes和Williams)。
M. Scholes和J. Williams,“從非同步數據估計beta”,《金融經濟學雜志》 5:309-327,1977年。
現在,常規的5分鍾系列部分:
as.numeric(as.POSIXct("1970-01-01 03:00:00")) # 0; starting point for ZERO seconds. "1970-01-01 03:01:00" equals 60.
as.numeric(as.POSIXct("2017-06-01 00:00:00")) # 1496264400
# Passed seconds after the first observation in the dataset
PassedSecs <- as.numeric(as.POSIXct(LowFrqData$X2)) - 1496264400
LowFrq5minuteRaw <- cbind(LowFrqData, PassedSecs, stringsAsFactors=FALSE)
LowFrq5minuteRaw
X1 X2 X3 x4 PassedSecs
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615 0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573 1
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600 2
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000 2581192
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450 2581194
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487 2581196
5分鍾表示5 * 60 = 300秒。 因此,“將相同的商划分為300”將觀察結果每隔5分鍾分組一次。
LowFrq5minuteRaw2 <- cbind(LowFrqData, PassedSecs, QbyDto300 = PassedSecs%/%300, stringsAsFactors=FALSE)
LowFrq5minuteRaw2
X1 X2 X3 x4 PassedSecs QbyDto300
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615 0 0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573 1 0
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600 2 0
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000 2581192 8603
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450 2581194 8603
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487 2581196 8603
indices2 <- seq_along(LowFrq5minuteRaw2[,6])[!(duplicated(LowFrq5minuteRaw2[,6]))] # 1 4; the beginnings of groups
LowFrq5minute <- data.frame(X1=c("GBP/USD"), X2=LowFrq5minuteRaw2[indices2,2], X3=aggregate(LowFrqData[,3] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2], X4=aggregate(LowFrqData[,4] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2])
LowFrq5minute
X1 X2 X3 X4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287596
2 GBP/USD 2017-06-30 20:59:52 1.301163 1.303646
X2
保持間隔上5分鍾的Obs代表第一次發生的時間戳。
我認為使用aggregate
函數可以使所有這些操作變得容易。 但是,根據數據,您可能需要將datetime列轉換為character(以防原始數據保留毫秒值)。 如果需要,我建議使用lubridate
將其轉換回日期時間。
GBPUSD$X2 <- as.character(GBPUSD$X2) #optional; if the below yields bad results
GBPUSD$X2 <- substr(GBPUSD$X2, 1, 19) #optional; to get only upto minutes after above command
# get High values for both bid and ask prices:
GBPUSD_H <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=max)
# get Low values for both bid and ask prices:
GBPUSD_L <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=min)
# merging the High and low values together
GBPUSD_NEW <- data.table::merge(GBPUSD_H, GBPUSD_L, by=c("X1", "X2"), suffixes=c(".HIGH", ".LOW"))
要一次獲得所有的高,低,打開和關閉值:
GBPUSD <- data.table(GBPUSD, key=c("X1", "X2"))
GBPUSD_NEW <- GBPUSD[, list(X3.HIGH=max(X3), X3.LOW=min(X3), X3.OPEN=X3[1],
X3.CLOSE=X3[length(X3)], X4.HIGH=max(X4), X4.LOW=min(X4),
X4.OPEN=X4[1], X4.CLOSE=X4[length(X4)]), by=c("X1", "X2")]
但是,要使此方法起作用,首先需要對數據進行排序,以使第一個值是打開的值,最后一個值是每秒的關閉值。
現在,如果您需要使用分鍾而不是秒(或小時),只需相應地調整substr
。 如果您想進行更多自定義,例如間隔15分鍾,我建議添加一個幫助器列。 樣例代碼:
GBPUSD$MIN <- floor(as.numeric(substr(GBPUSD$X2, 15, 16))/15) #getting 00:00 for 00:00-00:15
GBPUSD$X2 <- paste0(substr(GBPUSD$X2, 1, 14), GBPUSD$MIN, ":00")
請不要猶豫,詢問您的要求是否得到滿足。
PS:如果鍵列中有問題,則NA
aggregate
產生問題。 首先處理它們。
GBPUSD$X2[is.na(GBPUSD$X2)] <- "2017:05:05 00:00:00" #example; you need to be careful to use same class and format for the replacement
當您想嘗試很棒的tibbletime
軟件包時,這是一個非常完美的示例。 我將生成自己的數據來說明問題
library(tibbletime)
df <- tibbletime::create_series(2017-12-20 + 01:06:00 ~ 2017-12-20 + 01:20:00, "sec") %>%
mutate(open=runif(nrow(.)),
close=runif(nrow(.)))
df
現在是15分鍾的秒分辨率數據
# A time tibble: 841 x 3
# Index: date
date open close
* <dttm> <dbl> <dbl>
1 2017-12-20 01:06:00 0.63328803 0.357378011
2 2017-12-20 01:06:01 0.09597444 0.150583962
3 2017-12-20 01:06:02 0.23601820 0.974341599
4 2017-12-20 01:06:03 0.71832656 0.092265867
5 2017-12-20 01:06:04 0.32471587 0.391190310
6 2017-12-20 01:06:05 0.76378711 0.534765217
7 2017-12-20 01:06:06 0.92463265 0.694693458
8 2017-12-20 01:06:07 0.74026638 0.006054806
9 2017-12-20 01:06:08 0.77064030 0.911641146
10 2017-12-20 01:06:09 0.87130949 0.740816479
# ... with 831 more rows
更改數據的周期性就像一個命令一樣簡單:
as_period(df, 5~M)
這會將數據匯總到5分鍾間隔(默認情況下,時間間隔是每個時段的第一個觀察值,而不是平均值或總和)
# A time tibble: 3 x 3
# Index: date
date open close
* <dttm> <dbl> <dbl>
1 2017-12-20 01:06:00 0.6332880 0.3573780
2 2017-12-20 01:11:00 0.9235639 0.7043025
3 2017-12-20 01:16:00 0.6955685 0.1641798
看看這個很棒的小插圖 ,了解更多詳細信息
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.