简体   繁体   English

即使我的训练数据集是整个数据集的 5%,LSTM model 也给了我 99% 的 R 平方

[英]LSTM model is giving me 99% R-squared even if my training data set is 5% of the overall set

I'm using a LSTM model to perform time series forecasting.我正在使用 LSTM model 来执行时间序列预测。 I have a weird issue where my R-squared is basically always 99% even if my training data set is 5% of my total data set.我有一个奇怪的问题,即使我的训练数据集占总数据集的 5%,我的 R 平方基本上总是 99%。 I plot the graph between the predicted values and the test data and it looks almost identical? I plot 预测值和测试数据之间的图表看起来几乎相同? How is this even possible?这怎么可能?

My data is like so after normalization规范化后我的数据是这样的

date    0   1   2   3   4   5   6   7   8   9
0   2019-01-01 00:00:01+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
1   2019-01-01 00:00:02+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
2   2019-01-01 00:00:07+00:00   0.000025    0.000103    0.000   0.492   0.508   0.738780    0.079178    0.076970    0.079109    0.077500
3   2019-01-01 00:00:07+00:00   0.000000    0.000002    0.000   1.000   0.000   0.500000    0.079178    0.076970    0.079109    0.077500
4   2019-01-01 00:00:08+00:00   0.000000    0.000000    0.000   0.000   1.000   0.711130    0.079178    0.076970    0.079109    0.077500
... ... ... ... ... ... ... ... ... ... ... ...
116022  2020-07-28 08:39:59+00:00   0.000000    0.000000    0.000   0.844   0.156   0.786466    0.781738    0.782749    0.781928    0.782748
116023  2020-07-28 08:44:57+00:00   0.000000    0.000000    0.000   1.000   0.000   0.500000    0.781738    0.782749    0.781928    0.782748
116024  2020-07-28 08:47:59+00:00   0.000000    0.000000    0.244   0.756   0.000   0.279403    0.781738    0.782749    0.781928    0.782748
116025  2020-07-28 09:15:26+00:00   0.000000    0.000000    0.000   0.735   0.265   0.965187    0.781738    0.782749    0.781928    0.782748
116026  2020-07-28 09:15:40+00:00   0.000000    0.000000    0.000   0.784   0.216   0.755760    0.781738    0.782749    0.781928    0.782748
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.optimizers import Adam

model = Sequential()
model.add(LSTM(64, input_shape=x_train.shape[1:3], return_sequences=False))
model.add(Dense(1)) 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

history = model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=1)

train_pred = model.predict(x_train)
y_pred = model.predict(x_test)
print('R2 Score: ', r2_score(y_test, y_pred))
print('MAE: ', mean_absolute_error(y_test, y_pred))

Results结果

R2 Score:  0.9959650143133337
MAE:  0.008859985819425287

Mathematically, The R-Squared 's purpose is to give you an estimation on the fraction of your model's variance that is explained by your model's independent features.从数学上讲, R-Squared的目的是为您估计模型的独立特征所解释的模型方差的比例。

The formula goes as follows: [1 - (SSres / SStot)].公式如下:[1 - (SSres / SStot)]。

Where: SStot stands for the sum of your total squared error and SSres stands for residual sum of squares.其中: SStot 代表总平方误差之和,SSres 代表残差平方和。

As both SSres and SStot are being a sum of something that is aggregated on the same amount of 'n' records on your dataset, the number of records you have on your dataset (training dataset in your case) can change the R-Squared metric but shouldn't make any dramatic changes to it as a metric.由于 SSres 和 SStot 都是在数据集上相同数量的“n”条记录上聚合的总和,因此您在数据集上拥有的记录数(在您的情况下为训练数据集)可以改变 R-Squared 指标但不应将其作为指标进行任何重大更改。 It is safe to say that the R-Squared as a metric isn't reflecting anything that has to do with the amount of data you have as it is being nullified by the deviation between SSres and SStot.可以肯定地说,R-Squared 作为一个指标并没有反映与您拥有的数据量有关的任何事情,因为它被 SSres 和 SStot 之间的偏差所抵消。

For the 99% result, you are dealing with in your model: it probably just means that your independent features have a pretty high predictive value over your dependent value.对于 99% 的结果,您正在处理 model:这可能只是意味着您的独立特征对您的依赖值具有相当高的预测值。 I would check if any of my X variables have any direct connection to my y variable.我会检查我的任何 X 变量是否与我的 y 变量有任何直接联系。 (as if it is an arithmetic combination that contains y's value in it). (好像它是一个包含 y 值的算术组合)。 I would also try to get a sense about the std I have per every feature I include in my model.我还将尝试了解我在 model 中包含的每个功能的标准。 A small std may decrease the SSres and therefore lead to a high R-Squared metric.较小的标准可能会降低 SSres,因此会导致较高的 R-Squared 度量。

Most importantly: R-Squared =/= Accuracy.!!!!最重要的是:R 平方 =/= 准确度。!!!!!! the two metrics have very little to do with each other mathematically.这两个指标在数学上几乎没有关系。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM