测试值后的Keras预测

Question

I am currently trying to build neuronal network to be able to predict time series, but the question is, is it possible to predict further than just the test dataset. 我目前正在尝试建立神经元网络以能够预测时间序列，但是问题是，是否可以进一步预测而不只是测试数据集。 I mean, for my example, I have a dataset of about 3000 values, from which I keep 90% for training and 10% for testing. 我的意思是，例如，我有一个大约3000个值的数据集，其中90％用于训练，而10％用于测试。 Then When I compare the prediction with the actual test value, it corresponds, but is it possible for instance to ask the program to predict the next 500 values (ie from 3001 to 3500) ? 然后，当我将预测值与实际测试值进行比较时，它对应，但是例如是否可以要求程序预测下一个500个值（即从3001到3500）？

Here is a snipper of the code I use. 这是我使用的代码片段。

import csv
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, GRU
from keras.models import Sequential
from keras import optimizers
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.kernel_ridge import KernelRidge
import time
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (-1, 1))


def load_data(datasetname, column, seq_len, normalise_window):
    # A support function to help prepare datasets for an RNN/LSTM/GRU
    data = datasetname.loc[:,column]

    sequence_length = seq_len + 1
    result = []
    for index in range(len(data) - sequence_length):
        result.append(data[index: index + sequence_length])

    result = np.array(result)
    result.reshape(-1,1)
    training_set_scaled = sc.fit_transform(result)

    print (result)
    #Last 10% is used for validation test, first 90% for training
    row = round(0.9 * training_set_scaled.shape[0])
    train = training_set_scaled[:int(row), :]
    #np.random.shuffle(train)
    x_train = train[:, :-1]
    y_train = train[:, -1]
    X_test = training_set_scaled[int(row):, :-1]
    y_test = training_set_scaled[int(row):, -1]
    print ("shape train", x_train)
    print ("shape train", X_test)
    x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
    X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))  

    return [x_train, X_test, y_train, y_test]


def build_model():
    model = Sequential()
    layers = {'input': 100, 'hidden1': 150, 'hidden2': 256, 'hidden3': 100, 'output': 10}

    model.add(LSTM(
            50, 
            return_sequences=True, 

            input_shape=(200,1)
            ))
    model.add(Dropout(0.2))

    model.add(LSTM(
            layers['hidden2'],
            return_sequences=True,
           ))
    model.add(Dropout(0.2))

    model.add(LSTM(
            layers['hidden3'],
            return_sequences=False,
            ))
    model.add(Dropout(0.2))

    model.add(Activation("linear"))

    model.add(Dense(
            output_dim=layers['output']))


    start = time.time()
    model.compile(loss="mean_squared_error", optimizer="adam")
    print ("Compilation Time : ", time.time() - start)
    return model

dataset = pd.read_csv(
    'data.csv')
X_train, X_test, y_train, y_test = load_data(dataset, 'mean anomaly', 200, False)
model = build_model()
print ("train",X_train)
print ("test",X_test)

model.fit(X_train, y_train, batch_size=256, epochs=1,  validation_split=0.05)
predictions =  model.predict(X_test)
predictions = np.reshape(predictions, (predictions.size,))
plt.figure(1)
plt.subplot(311)
plt.title("Actual Test Signal w/Anomalies & noise")
plt.plot(y_test)
plt.subplot(312)
plt.title("predicted signal")
plt.plot(predictions, 'g')
plt.subplot(313)
plt.title("training signal")
plt.plot(y_train, 'b')
plt.plot(y_test, 'y')
plt.legend(['train', 'test'])
plt.show()

I have read that I should increase the output dim of the dense layer to get more than 1 predicted value, or increase the size of my window in the load data function ? 我已经读过我应该增加密集层的输出暗淡以获得大于1的预测值，还是在负载数据函数中增加窗口的大小？

Here is the result, the yellow plot is supposed to be after the blue one, it respresents my input test data, the first plot is a zoom on this data and the second one the prediction. 这是结果，黄色图应该在蓝色之后，它代表了我输入的测试数据，第一个图是对该数据的缩放，第二个图是预测。

Answer 1

If you want to predict the output value of your serie at t+x based on data at time t , the data you need to feed to the network should already have this format. 如果要基于时间t的数据预测t + x时的意向输出值，则需要馈送到网络的数据应该已经具有此格式。

Time series data formating : 时间序列数据格式：

If you have 3000 data point and want to predict the output value for the next "virtual" 500 point you should offset the output value by this amount. 如果您有3000个数据点，并且要预测下一个“虚拟” 500点的输出值，则应将输出值偏移此数量。 For exemple : 举个例子：

In your dataset, your 500th data point correspond to the 500th output value. 在数据集中，第500个数据点对应于第500个输出值。 If you want to predict "future" values then the 500th data point should have the 1000th output value. 如果要预测“未来”值，则第500个数据点应具有第1000个输出值。 You can do this in pandas with the shift function. 您可以使用shift功能在熊猫中执行此操作。 Be aware that you will loose the last 500 data point by doing so, has they will no longer have an output value. 请注意，这样做将失去最后的500个数据点，因为它们将不再具有输出值。

Then when you predict on data point xi you'll have the output value yi+500. 然后，当您预测数据点xi时，将获得输出值yi + 500。 You should find some basic guides for time serie forecasting on sites like machinelearningmastery 您应该在诸如Machinelearningmastery之类的网站上找到一些有关时间序列预测的基本指南。

Good pratice for model evaluation : 模型评估的良好实践：

If you want to better evaluate the quality of your model, first find some metrics that suits your problem and try to increase test set percenatage. 如果您想更好地评估模型的质量，请首先找到一些适合您问题的指标，然后尝试增加测试集的适用性。 While graphics are a good way to visualise result, they can be deceiving, try combining them with some metrics ! 虽然图形是可视化结果的好方法，但它们可能具有欺骗性，请尝试将它们与一些指标结合起来！ (be carefull with Mean Squarred Error, it can give you a biased score with value in the range [-1;1] as the square of an error in this range will always be less than the acutal error, try Mean Absolute Error instead) （请谨慎对待均方误差，它会为您提供一个偏见的分数，其值的范围为[-1; 1]，因为此范围内的误差平方始终小于实际误差，请尝试使用均值绝对误差）

Data leakage when scalling data : 标度数据时数据泄漏：

While scalling data is usually a good thing you need to be carefull doing so. 虽然缩放数据通常是一件好事，但您需要谨慎行事。 You comited something called a data leak. 您承诺了一种称为数据泄漏的事情。 You used scalling on the whole data set before splitting into training and test set. 在拆分为训练集和测试集之前，您对整个数据集使用了缩放比例。 Further reading about this data leak . 进一步阅读有关此数据泄漏的信息。

Update 更新资料

I think i misunderstood your problem. 我想我误会了你的问题。

If you want to "predict further than just the test dataset" you will need some unseen/new data to make more prediction. 如果您想“预测的不仅仅是测试数据集”，您将需要一些看不见的/新的数据来进行更多的预测。 The test set is only made to evaluate the performance of the learning phase. 测试集仅用于评估学习阶段的性能。

Now if you want to predict further than just the next step (this won't allow you to "predict further than just the test dataset" because of the way you change your dataset, see bellow) : Your model as it's made will only ever predict the next step. 现在，如果您只想对下一步做进一步的预测（由于更改数据集的方式，这将使您无法“仅对测试数据集进行更进一步的预测”，请参见下面的内容）：建立的模型将永远预测下一步。

In your example you feed to the algorithm series of lenght 'seq_len' and give them as output the value right after the end of those series. 在您的示例中，您将输入长度为'seq_len'的算法系列，并在这些系列结束后立即为其提供输出值。 If you want your algorithm to learn to predict in more than one step into the future you y_train must have value at the corresponding time in the future, example : 如果您想让算法学会对未来进行更多的预测，则y_train必须在未来的相应时间具有价值，例如：

x = [0,1,2,3,4,5,6,7,8,9,10,...]
seq_len = 5
step_to_predict = 5

So to predict not one step into the future but five, your series will have to look like this : 因此，要预测不是未来的一步而是五步，您的系列将必须像这样：

x_serie_1 = [0,1,2,3,4]
y_serie_1 = [9]
x_serie_2 = [1,2,3,4,5]
y_serie_2 = [10]

This is a way to get your model to learn how to make predictions further into the future than just the next step. 这是使您的模型学习如何对未来进行预测的一种方法，而不仅仅是下一步。

测试值后的Keras预测

问题描述

1 个解决方案

解决方案1
0 2019-03-27 16:00:26

Time series data formating : 时间序列数据格式：

Good pratice for model evaluation : 模型评估的良好实践：

Data leakage when scalling data : 标度数据时数据泄漏：

Update 更新资料

测试值后的Keras预测

问题描述

1 个解决方案

解决方案1 0 2019-03-27 16:00:26

Time series data formating : 时间序列数据格式：

Good pratice for model evaluation : 模型评估的良好实践：

Data leakage when scalling data : 标度数据时数据泄漏：

Update 更新资料

解决方案1
0 2019-03-27 16:00:26