How to use statsmodels' ARMA to predict with exogenous variables?

Question

I am trying to use statsmodels to fit an AR(MA) process with exogenous variables. For that, I generated a realization of an AR(0) process with a delayed exogenous variable and I am trying to recover what I would expect from it. I am able to correctly fit the process, but I am not being able to use the predict method.

The following code is an MCVE of what I want to achieve. It is heavily commented so that you can easily follow it. The last statement is an assertion that fails, and I would like to make it pass. I suspect that the culprit is how I am calling the function .predict .

import numpy as np
import statsmodels.tsa.api


def _transform_x(x, lag):
    """
    Converts a set of time series into a matrix of delayed signals.
    For x.shape[0] == 1, it is equivalent to call `statsmodels.tsa.api.lagmat(x_i, lag)`.

    For x.shape[0] == 1, each `row_j` is each time `t`, `column_i` is the signal at `t - i`,
    It assumes that no past signal => no signal: each row is left-padded with zeros.

    For example, for lag=3, the matrix would be:
    ```
    [0, 0   , 0   ] (-> y[0])
    [0, 0   , x[0]] (-> y[1])
    [0, x[0], x[1]] (-> y[2])
    ```

    I.e.
    The parameter fitted to column 2, a2, is the influence of `x[t - 1]` on `y[t]`.
    The parameter fitted to column 1, a1, is the influence of `x[t - 2]` on `y[t]`.
    It assumes that we only measure x[t] when we measure y[t], the reason why that column does not appear.

    For x.shape[0] > 1, it returns a concatenation of each of the matrixes for each signal.
    """
    for x_i in x:
        assert len(x_i) >= lag
        assert len(x_i.shape) == 1, 'Each of the elements must be a time-series (1D)'
    return np.concatenate(tuple(statsmodels.tsa.api.lagmat(x_i, lag) for x_i in x), axis=1)


# build the realization of the process y[t] = 1*x[t-2] + noise, where x[t] is iid from N(1,1)
t = np.arange(0, 1000, 1)
np.random.seed(1)

# the exogenous variable
x1 = 1 + np.random.normal(size=t.shape)

# this shifts x by 2 (puts the last element in the beginning, we set the beginning to 0)
y = np.roll(x1, 2) + np.random.normal(scale=0.01, size=t.shape)
y[0] = y[1] = 0

x = np.array([x1])  # x.shape[0] => each exogenous variable; x.shape[1] => each time point

# fit it with AR(2) + exogenous(2)
lag = 2

result = statsmodels.tsa.api.ARMA(y, (lag, 0), exog=_transform_x(x, lag)).fit(disp=False)

# this gives the expected. Specifically, `x2 = 0.9952` and all others are indistinguishable from 0.
# (x2 here means the highest delay, 2).
print(result.summary())

# predict 1 element out-of-sample. Because the process is y[t] = x[0, t - 2] + noise,
# the prediction should be equal to `x[0, -2]`
y_pred = result.predict(len(y), len(y), exog=_transform_x(x[:, -3:], lag))[0]

# this fails!
np.testing.assert_almost_equal(y_pred, x[0, -2], decimal=2)

Answer 1

There are two problems, as far as I can see

exog=_transform_x(x[:, -3:], lag) in predict has the initial value problem and includes zeros instead of lags.

indexing: the prediction for y[-1] should be x[-3], ie two lags behind. If we want to forecast the next observation, then we need an extended exog x array corresponding to the forecast period.

If I change this, then the assert passes for me for y[-1]:

>>> y_pred = result.predict(len(y)-1, len(y)-1, exog=_transform_x(x[:, -10:], lag)[-1])
>>> y_pred
>>> array([ 0.9308579])
>>> result.fittedvalues[-1]
>>> 
0.93085789893991366

>>> x[0, -3]
0.93037546054487086

>>> np.testing.assert_almost_equal(y_pred, x[0, -3], decimal=2)
>>>

The above is for predicting the last observation. To forecast the first out of sample observation, we need the last and the second to last x, which cannot be obtained through the _transform_x function. For the example, I just provide it in a list.

>>> y_pred = result.predict(len(y), len(y), exog=[[x[0, -1], x[0, -2]]])
>>> y_pred
array([ 1.35420494])
>>> x[0, -2]
1.3538704268828403
>>> np.testing.assert_almost_equal(y_pred, x[0, -2], decimal=2)
>>>

More general, to forecast for a longer horizon, we need an array of future explanatory variables

>>> xx = np.concatenate((x, np.ones((x.shape[0], 10))), axis=1)
>>> result.predict(len(y), len(y)+9, exog=_transform_x(xx[:, -(10+lag):], lag)[-10:])
>>> 
array([ 1.35420494,  0.81332158,  1.00030139,  1.00030334,  1.000303  ,
        1.00030299,  1.00030299,  1.00030299,  1.00030299,  1.00030299])

I have chosen the indexing so that the exog for predict contains the last two observations in the first row.

>>> _transform_x(xx[:, -(10+lag):], lag)[lag:]
array([[ 0.81304498,  1.35387043],
       [ 1.        ,  0.81304498],
       [ 1.        ,  1.        ],
       ..., 
       [ 1.        ,  1.        ],
       [ 1.        ,  1.        ],
       [ 1.        ,  1.        ]])

How to use statsmodels' ARMA to predict with exogenous variables?

Question

1 answers

solution1
1 ACCPTED 2018-02-08 15:38:54

How to use statsmodels' ARMA to predict with exogenous variables?

Question

1 answers

solution1 1 ACCPTED 2018-02-08 15:38:54

solution1
1 ACCPTED 2018-02-08 15:38:54