有人可以帮助解释为什么我的 MLP 不断获得完美的分类报告吗？

Question

I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition.我正在使用 Sklearn.train_test_split 和 sklearn.MLPClassifier 进行人类活动识别。 Below is my dataset in a pandas df:下面是我在 pandas df 中的数据集：


a_x a_y a_z g_x g_y g_z activity
0   3.058150    5.524902    -7.415221   0.001280    -0.022299   -0.009420   sit
1   3.065333    5.524902    -7.422403   -0.003514   -0.023764   -0.007289   sit
2   3.065333    5.524902    -7.422403   -0.003514   -0.023764   -0.007289   sit
3   3.064734    5.534479    -7.406840   -0.016830   -0.025628   -0.003294   sit
4   3.074910    5.548246    -7.408038   -0.023488   -0.025495   -0.001963   sit
... ... ... ... ... ... ... ...
246886  8.102990    -1.226492   -4.559391   -0.511287   0.081455    0.109515    run
246887  8.120349    -1.218711   -4.595306   -0.516480   0.089179    0.110047    run
246888  8.126933    -1.209732   -4.619848   -0.521940   0.096636    0.109382    run
246889  8.140102    -1.199556   -4.622840   -0.526467   0.102761    0.108183    run
246890  8.142496    -1.199556   -4.648580   -0.530728   0.109818    0.108050    run
1469469 rows × 7 columns

I am using the 6 numerical columns (x,y,z from accelerometer and gyrosphere) to predict activity (run, sit, walk).我正在使用 6 个数字列（来自加速度计和陀螺仪的 x、y、z）来预测活动（跑、坐、走）。 My code looks like我的代码看起来像

mlp=MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', learning_rate='adaptive', 
                 early_stopping=True, learning_rate_init=.001)

X=HAR.drop(columns='activity').to_numpy()
y=HAR['activity'].to_numpy()

X_train, X_test, y_train, y_test=train_test_split(X,y, train_size=0.10)

mlp.fit(X_train, y_train)
predictions_train=mlp.predict(X_train)
predictions_test=mlp.predict(X_test)

print("Fitting of train data for size (10,): \n",classification_report(y_train,predictions_train))
print("Fitting of test data for size (10,): \n",classification_report(y_test,predictions_test))

Output is: Output 是：

Fitting of train data for size (10,): 
               precision    recall  f1-score   support

         run       1.00      1.00      1.00     49265
         sit       1.00      1.00      1.00     49120
        walk       1.00      1.00      1.00     48561

    accuracy                           1.00    146946
   macro avg       1.00      1.00      1.00    146946
weighted avg       1.00      1.00      1.00    146946

Fitting of test data for size (10,): 
               precision    recall  f1-score   support

         run       1.00      1.00      1.00    441437
         sit       1.00      1.00      1.00    442540
        walk       1.00      1.00      1.00    438546

    accuracy                           1.00   1322523
   macro avg       1.00      1.00      1.00   1322523
weighted avg       1.00      1.00      1.00   1322523

I am relatively new to ML but I think I understand the concept of overfitting, so I imagine that is what is happening here, but I don't understand how it is being overfit when it is only being trained on 10% of the dataset?我对 ML 比较陌生，但我想我理解过度拟合的概念，所以我想这就是这里发生的事情，但我不明白当它只在 10% 的数据集上接受训练时它是如何过度拟合的？ Also, presumably the classification report should always be perfect for the X_train data since that is what the model is being trained on, correct?此外，大概分类报告对于 X_train 数据应该始终是完美的，因为这就是 model 正在接受培训的内容，对吗？

No matter what I do, it always produces a perfect classification_report for the X_test data no matter how little data I train it on (in this case.10 but i've done.25, .5, .33 etc.).无论我做什么，它总是为 X_test 数据生成完美的 classification_report，无论我训练它的数据有多少（在本例中为 10，但我已经完成 25、5、33 等）。 I even removed the gyrosphere data and only trained it on the accelerometer data and it still gave a perfect 1 for each precision, recall, and F1.我什至删除了陀螺仪数据，只在加速度计数据上对其进行训练，它仍然为每个精度、召回率和 F1 给出了完美的 1。

When I arbitrarily slice the original dataset in half and use the resulting arrays as train and test data then the predictions for X_test are not perfect but every time I use the sklearn.train_test_split it returns a perfect classification report....So i assume I am doing something wrong with how I am using train_test_split?当我任意将原始数据集切成两半并使用生成的 arrays 作为训练和测试数据时，X_test 的预测并不完美，但每次我使用 sklearn.train_test_split 时，它都会返回一个完美的分类报告....所以我假设我我使用 train_test_split 的方式有问题吗？

Answer 1

(this really should be a comment but I don't have the reputation to allow for comments yet.) （这真的应该是一个评论，但我还没有允许评论的声誉。）

It's quite hard to say without having access to the data to try out.如果无法访问数据进行尝试，很难说。

I wonder if within the data itself, the class separation is really clear such that the classifier has no trouble distinguishing.我想知道在数据本身中，class 的分离是否真的很清楚，以至于分类器可以轻松区分。 (It seems so just seeing the values you printed.. The distributions are very different and well separated if you plot them. So to be fair a NN is overkill, if even by visual plotting we are able to clearly distinguish different activities.) （看起来只是看到你打印的值。如果你 plot 它们的分布非常不同并且分离得很好。所以公平地说，神经网络是矫枉过正的，即使通过视觉绘图我们也能够清楚地区分不同的活动。）

Have you tried smaller hidden layer sizes, say only 1 or 2 nodes, or some other simpler classifier?您是否尝试过较小的隐藏层大小，比如只有 1 或 2 个节点，或其他一些更简单的分类器？ Eg decision tree with max_depth set, say to <4, or just a logistic regression model.例如，设置了max_depth的决策树小于 4，或者只是逻辑回归 model。

Also did you try stratifying: train_test_split(X,y, train_size=0.10, stratify=y)你也试过分层： train_test_split(X,y, train_size=0.10, stratify=y)

My guess, I think it's just a very simple dataset, thus the classifier is doing very well because the class separations are so clear.我猜，我认为这只是一个非常简单的数据集，因此分类器表现非常好，因为 class 分离非常清晰。 So it's nothing to do with overfitting.所以这与过度拟合无关。

有人可以帮助解释为什么我的 MLP 不断获得完美的分类报告吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-11-28 10:40:03

有人可以帮助解释为什么我的 MLP 不断获得完美的分类报告吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-11-28 10:40:03

解决方案1
1 已采纳 2022-11-28 10:40:03