How to split the training data and test data for LSTM for time series prediction in Tensorflow

Question

I recently learn the LSTM for time series prediction from https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb

In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.

def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""

# Infinite loop.
while True:
    # Allocate a new array for the batch of input-signals.
    x_shape = (batch_size, sequence_length, num_x_signals)
    x_batch = np.zeros(shape=x_shape, dtype=np.float16)

    # Allocate a new array for the batch of output-signals.
    y_shape = (batch_size, sequence_length, num_y_signals)
    y_batch = np.zeros(shape=y_shape, dtype=np.float16)

    # Fill the batch with random sequences of data.
    for i in range(batch_size):
        # Get a random start-index.
        # This points somewhere into the training-data.
        idx = np.random.randint(num_train - sequence_length)

        # Copy the sequences of data starting at this index.
        x_batch[i] = x_train_scaled[idx:idx+sequence_length]
        y_batch[i] = y_train_scaled[idx:idx+sequence_length]

    yield (x_batch, y_batch)

He try to create several bacth samples for training.

My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled , and then begin sampling several batch size using the follow batch_generator ?

my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?

In the tutorial, the author chose a piece of continuous samples such as

x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]

Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?

In summary: The follow two question are

(1) can we first randomly shuttle the x_train_scaled and y_train_scaled , and then begin sampling several batch size using the follow batch_generator ?

(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.

Thanks

Answer 1

(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?

Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected . For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.

(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.

Answer 2

It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.

However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.

In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.

About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.

How to split the training data and test data for LSTM for time series prediction in Tensorflow

Question

2 answers

solution1
6 ACCPTED 2019-03-08 14:56:04

solution2
1 2019-03-08 13:02:19

How to split the training data and test data for LSTM for time series prediction in Tensorflow

Question

2 answers

solution1 6 ACCPTED 2019-03-08 14:56:04

solution2 1 2019-03-08 13:02:19

solution1
6 ACCPTED 2019-03-08 14:56:04

solution2
1 2019-03-08 13:02:19