LSTM model is giving me 99% R-squared even if my training data set is 5% of the overall set

Question

I'm using a LSTM model to perform time series forecasting. I have a weird issue where my R-squared is basically always 99% even if my training data set is 5% of my total data set. I plot the graph between the predicted values and the test data and it looks almost identical? How is this even possible?

My data is like so after normalization

date    0   1   2   3   4   5   6   7   8   9
0   2019-01-01 00:00:01+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
1   2019-01-01 00:00:02+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
2   2019-01-01 00:00:07+00:00   0.000025    0.000103    0.000   0.492   0.508   0.738780    0.079178    0.076970    0.079109    0.077500
3   2019-01-01 00:00:07+00:00   0.000000    0.000002    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
4   2019-01-01 00:00:08+00:00   0.000000    0.000000    0.000   0.000   1.000   0.711130    0.079178    0.076970    0.079109    0.077500
... ... ... ... ... ... ... ... ... ... ... ...
116022  2020-07-28 08:39:59+00:00   0.000000    0.000000    0.000   0.844   0.156   0.786466    0.781738    0.782749    0.781928    0.782748
116023  2020-07-28 08:44:57+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.781738    0.782749    0.781928    0.782748
116024  2020-07-28 08:47:59+00:00   0.000000    0.000000    0.244   0.756   0.000   0.279403    0.781738    0.782749    0.781928    0.782748
116025  2020-07-28 09:15:26+00:00   0.000000    0.000000    0.000   0.735   0.265   0.965187    0.781738    0.782749    0.781928    0.782748
116026  2020-07-28 09:15:40+00:00   0.000000    0.000000    0.000   0.784   0.216   0.755760    0.781738    0.782749    0.781928    0.782748

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.optimizers import Adam

model = Sequential()
model.add(LSTM(64, input_shape=x_train.shape[1:3], return_sequences=False))
model.add(Dense(1)) 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

history = model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=1)

train_pred = model.predict(x_train)
y_pred = model.predict(x_test)
print('R2 Score: ', r2_score(y_test, y_pred))
print('MAE: ', mean_absolute_error(y_test, y_pred))

Results

R2 Score:  0.9959650143133337
MAE:  0.008859985819425287

Answer 1

Mathematically, The R-Squared 's purpose is to give you an estimation on the fraction of your model's variance that is explained by your model's independent features.

The formula goes as follows: [1 - (SSres / SStot)].

Where: SStot stands for the sum of your total squared error and SSres stands for residual sum of squares.

As both SSres and SStot are being a sum of something that is aggregated on the same amount of 'n' records on your dataset, the number of records you have on your dataset (training dataset in your case) can change the R-Squared metric but shouldn't make any dramatic changes to it as a metric. It is safe to say that the R-Squared as a metric isn't reflecting anything that has to do with the amount of data you have as it is being nullified by the deviation between SSres and SStot.

For the 99% result, you are dealing with in your model: it probably just means that your independent features have a pretty high predictive value over your dependent value. I would check if any of my X variables have any direct connection to my y variable. (as if it is an arithmetic combination that contains y's value in it). I would also try to get a sense about the std I have per every feature I include in my model. A small std may decrease the SSres and therefore lead to a high R-Squared metric.

Most importantly: R-Squared =/= Accuracy.!!!! the two metrics have very little to do with each other mathematically.

LSTM model is giving me 99% R-squared even if my training data set is 5% of the overall set

Question

1 answers

solution1
0 2020-08-06 14:04:16

LSTM model is giving me 99% R-squared even if my training data set is 5% of the overall set

Question

1 answers

solution1 0 2020-08-06 14:04:16

solution1
0 2020-08-06 14:04:16