简体   繁体   English

按分区循环自动 Arima

[英]Loop Auto Arima by Partition

I am collaborating on a project that requires me to use R of which I don't have any experience with to date.我正在合作一个项目,该项目要求我使用迄今为止我没有任何经验的 R。 I am trying to apply auto arima to partitions/windows within my dataset and I haven't the slightest clue on how to even begin.3我正在尝试将 auto arima 应用于数据集中的分区/窗口,但我对如何开始一无所知。 3

Essentially, I want to train a separate model on each partner_id using the rows c_id = "none" and then forecast/predict values out to the max(date) for each partner_id.本质上,我想使用行 c_id = "none" 在每个 partner_id 上训练一个单独的模型,然后预测/预测每个 partner_id 的最大值(日期)。 The number of months/rows for each partner vary in length.每个合作伙伴的月数/行数长短不一。 For this example data frame pasted below, partner_id = "1A9" has 12 months/rows with c_id = "none" vs partner_id = "1B9" has 13 months/row with c_id = "none".对于下面粘贴的这个示例数据框,partner_id = "1A9" 有 12 个月/行,c_id = "none" vs partner_id = "1B9" 有 13 个月/行,c_id = "none"。 The number of months/rows extended out to the max(Date) within each partner_is varies as well.每个partner_is 中扩展到max(Date) 的月数/行数也各不相同。 This is tricky as I assume I need to dynamically input how many months/rows to train on and how many months/rows to predict on for each partner_id.这很棘手,因为我假设我需要动态输入要训练的月数/行数以及要为每个合作伙伴 ID 预测的月数/行数。

I've included a sample dataset below.我在下面包含了一个示例数据集。

x <- data.frame("c_id" = c("none","none","none","none","none",
"none","none","none","none","none","none","none","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101", "none","none","none","none","none","none","none","none","none","none","none","none","none","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111"), "partner_id" = c("1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9"), "rev_month" = as.Date(c("2016-01-01","2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01", "2016-09-01","2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01", "2017-01-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01", "2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01","2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01", "2020-01-01", "2020-02-01", "2020-03-01")), "rev" = c(101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 124.10, 125.35, 125.45), stingsAsFactors=FALSE)

My apologies for not having any code starter code yet as I am still trying to think about this conceptually while not having much experience with R at all.我很抱歉还没有任何代码入门代码,因为我仍在尝试从概念上考虑这一点,而对 R 没有太多经验。 Ultimately, I'd like to add the column of predictions and confidence intervals back to my original dataframe.最终,我想将预测列和置信区间添加回我的原始数据帧。 I'd be open to any R and/or Python Solutions.我愿意接受任何 R 和/或 Python 解决方案。

My answer is wrong on many levels from the programming point of view concerning R and time-series.从有关 R 和时间序列的编程角度来看,我的回答在很多层面上都是错误的。 The main aspects are (there are other issues but I understand that your concern is making it work asap):主要方面是(还有其他问题,但我理解您的担忧是使其尽快工作):

  1. frist and foremost a loop should be avoided - BUT my guess is, that a vectorized solution would make it harder for you to understand首先应该避免循环 - 但我的猜测是,矢量化的解决方案会让你更难理解

  2. using arima for a time-series that have not at least two complete cycles (years in this case) is not very promissing if your are looking to pick up seasonal patterns.如果您希望获得季节性模式,那么将 arima 用于至少没有两个完整周期(在这种情况下为年)的时间序列并不是很有希望。

If your are genuinely interested in the topic of time-series predictions in R then read this book: https://otexts.com/fpp2/如果您真的对 R 中的时间序列预测主题感兴趣,请阅读这本书: https : //otexts.com/fpp2/

A relevant side problem is your testing data: both series for partner have a repeated date on the first and second position that does not fly with time-series prediction of fixed periods/intervals - I just lagged the first to make things work.一个相关的附带问题是您的测试数据:合作伙伴的两个系列在第一个和第二个位置都有重复的日期,这与固定周期/间隔的时间序列预测不符 - 我只是落后于第一个以使事情起作用。 Therefore the new training data is this (we do not need the stringsAsFactores=FALSE):因此新的训练数据是这样的(我们不需要stringsAsFactores=FALSE):

 x <- data.frame(c_id = c("none","none","none","none","none","none","none","none","none","none","none","none","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101", "none","none","none","none","none","none","none","none","none","none","none","none","none","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111"), "partner_id" = c("1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9"),
                rev_month = as.Date(c("2015-12-01","2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01", "2016-09-01","2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01", "2016-12-31","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01", "2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01","2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01", "2020-01-01", "2020-02-01", "2020-03-01")),
                rev = c(101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 124.10, 125.35, 125.45))

Now we set up a data.frame to store the predictions - though this is not correct in theory ("never grow a vector") and there are better solutions BUT it would make it more complicated and not help the understanding of the implementation:现在我们设置了一个 data.frame 来存储预测 - 尽管这在理论上是不正确的(“永远不会增长向量”)并且有更好的解决方案,但它会使它变得更加复杂并且无助于对实现的理解:

# empty data.frame to fill in predictions
predictions_df <- data.frame(c_id=character(),
                             partner=character(),
                             rev_month = character(),
                             rev=double())

Now we build a vector of unique partners to loop over:现在我们构建一个独特的合作伙伴向量来循环:

# unique partners
partners <- unique(x$partner_id)

Lets call the libraries we need for this exercise:让我们调用本练习所需的库:

library(xts)
library(dplyr)
library(forecast)

The main part is the loop itself:主要部分是循环本身:

# loop to build predictions and store them
for (i in 1:length(partners)){

  partner <- partners[i] # get specific partner
  x1 <- x[x$partner_id == partner, ] # get data for specific partner
  x1_t <- x1[x1$c_id == "none", c(3,4)] # training data
  x1_f <- x1[x1$c_id != "none", c(3,4)] # forecast data
  c_id <- x1[x1$c_id != "none", 1] # complementary data

  # convert training data to time-series object
  x1_t_ts <- xts(x1_t[,-1], order.by=as.Date(x1_t[,1], "%Y/%m/%d"))
  # run auto arima on the time series
  tm <- forecast::auto.arima(x1_t_ts)
  # forecast the number of future steps (rows for to predict data)
  fc <- forecast::forecast(tm, nrow(x1_f))

  predictions_df <- rbind(predictions_df, data.frame(c_id, partner, rev_month = as.character(x1_f$rev_month), rev = as.double(fc$mean)))

}

finally let us have a look at the results:最后让我们看看结果:

predictions_df

    c_id partner  rev_month      rev
1  c-100     1A9 2016-12-01 106.5409
2  c-100     1A9 2017-01-01 106.9818
3  c-100     1A9 2017-02-01 107.4227
4  c-100     1A9 2017-03-01 107.8636
5  c-100     1A9 2017-04-01 108.3045
6  c-100     1A9 2017-05-01 108.7455
7  c-100     1A9 2017-06-01 109.1864
8  c-100     1A9 2017-07-01 109.6273
9  c-100     1A9 2017-08-01 110.0682
10 c-100     1A9 2017-09-01 110.5091
11 c-100     1A9 2017-10-01 110.9500
12 c-100     1A9 2017-11-01 111.3909
13 c-101     1A9 2017-12-01 111.8318
14 c-101     1A9 2018-01-01 112.2727
15 c-101     1A9 2018-02-01 112.7136
16 c-101     1A9 2018-03-01 113.1545
17 c-101     1A9 2018-04-01 113.5955
18 c-101     1A9 2018-05-01 114.0364
19 c-101     1A9 2018-06-01 114.4773
20 c-101     1A9 2018-07-01 114.9182
21 c-101     1A9 2018-08-01 115.3591
22 c-101     1A9 2018-09-01 115.8000
23 c-101     1A9 2018-10-01 116.2409
24 c-101     1A9 2018-11-01 116.6818
25 c-101     1A9 2018-12-01 117.1227
26 c-110     1B9 2018-01-01 106.9375
27 c-110     1B9 2018-02-01 107.3750
28 c-110     1B9 2018-03-01 107.8125
29 c-110     1B9 2018-04-01 108.2500
30 c-110     1B9 2018-05-01 108.6875
31 c-110     1B9 2018-06-01 109.1250
32 c-110     1B9 2018-07-01 109.5625
33 c-110     1B9 2018-08-01 110.0000
34 c-110     1B9 2018-09-01 110.4375
35 c-110     1B9 2018-10-01 110.8750
36 c-110     1B9 2018-11-01 111.3125
37 c-110     1B9 2018-12-01 111.7500
38 c-111     1B9 2019-01-01 112.1875
39 c-111     1B9 2019-02-01 112.6250
40 c-111     1B9 2019-03-01 113.0625
41 c-111     1B9 2019-04-01 113.5000
42 c-111     1B9 2019-05-01 113.9375
43 c-111     1B9 2019-06-01 114.3750
44 c-111     1B9 2019-07-01 114.8125
45 c-111     1B9 2019-08-01 115.2500
46 c-111     1B9 2019-09-01 115.6875
47 c-111     1B9 2019-10-01 116.1250
48 c-111     1B9 2019-11-01 116.5625
49 c-111     1B9 2019-12-01 117.0000
50 c-111     1B9 2020-01-01 117.4375
51 c-111     1B9 2020-02-01 117.8750
52 c-111     1B9 2020-03-01 118.3125

If you like to get the confidence intervals, etc. please deconstruct the loop (run just the inner part with "i <- 1") and unterstand what is going on and what the returning values are.如果您想获得置信区间等,请解构循环(仅使用“i <- 1”运行内部部分)并了解正在发生的事情以及返回值是什么。 Then it should be no issue to use the shemata I have supplied to get what you need.那么使用我提供的shemata来获得你需要的东西应该没有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM