[英]Can someone help explain why my MLP keeps on getting a perfect classification report?
I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition.我正在使用 Sklearn.train_test_split 和 sklearn.MLPClassifier 进行人类活动识别。 Below is my dataset in a pandas df:
下面是我在 pandas df 中的数据集:
a_x a_y a_z g_x g_y g_z activity
0 3.058150 5.524902 -7.415221 0.001280 -0.022299 -0.009420 sit
1 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
2 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
3 3.064734 5.534479 -7.406840 -0.016830 -0.025628 -0.003294 sit
4 3.074910 5.548246 -7.408038 -0.023488 -0.025495 -0.001963 sit
... ... ... ... ... ... ... ...
246886 8.102990 -1.226492 -4.559391 -0.511287 0.081455 0.109515 run
246887 8.120349 -1.218711 -4.595306 -0.516480 0.089179 0.110047 run
246888 8.126933 -1.209732 -4.619848 -0.521940 0.096636 0.109382 run
246889 8.140102 -1.199556 -4.622840 -0.526467 0.102761 0.108183 run
246890 8.142496 -1.199556 -4.648580 -0.530728 0.109818 0.108050 run
1469469 rows × 7 columns
I am using the 6 numerical columns (x,y,z from accelerometer and gyrosphere) to predict activity (run, sit, walk).我正在使用 6 个数字列(来自加速度计和陀螺仪的 x、y、z)来预测活动(跑、坐、走)。 My code looks like
我的代码看起来像
mlp=MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', learning_rate='adaptive',
early_stopping=True, learning_rate_init=.001)
X=HAR.drop(columns='activity').to_numpy()
y=HAR['activity'].to_numpy()
X_train, X_test, y_train, y_test=train_test_split(X,y, train_size=0.10)
mlp.fit(X_train, y_train)
predictions_train=mlp.predict(X_train)
predictions_test=mlp.predict(X_test)
print("Fitting of train data for size (10,): \n",classification_report(y_train,predictions_train))
print("Fitting of test data for size (10,): \n",classification_report(y_test,predictions_test))
Output is: Output 是:
Fitting of train data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 49265
sit 1.00 1.00 1.00 49120
walk 1.00 1.00 1.00 48561
accuracy 1.00 146946
macro avg 1.00 1.00 1.00 146946
weighted avg 1.00 1.00 1.00 146946
Fitting of test data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 441437
sit 1.00 1.00 1.00 442540
walk 1.00 1.00 1.00 438546
accuracy 1.00 1322523
macro avg 1.00 1.00 1.00 1322523
weighted avg 1.00 1.00 1.00 1322523
I am relatively new to ML but I think I understand the concept of overfitting, so I imagine that is what is happening here, but I don't understand how it is being overfit when it is only being trained on 10% of the dataset?我对 ML 比较陌生,但我想我理解过度拟合的概念,所以我想这就是这里发生的事情,但我不明白当它只在 10% 的数据集上接受训练时它是如何过度拟合的? Also, presumably the classification report should always be perfect for the X_train data since that is what the model is being trained on, correct?
此外,大概分类报告对于 X_train 数据应该始终是完美的,因为这就是 model 正在接受培训的内容,对吗?
No matter what I do, it always produces a perfect classification_report for the X_test data no matter how little data I train it on (in this case.10 but i've done.25, .5, .33 etc.).无论我做什么,它总是为 X_test 数据生成完美的 classification_report,无论我训练它的数据有多少(在本例中为 10,但我已经完成 25、5、33 等)。 I even removed the gyrosphere data and only trained it on the accelerometer data and it still gave a perfect 1 for each precision, recall, and F1.
我什至删除了陀螺仪数据,只在加速度计数据上对其进行训练,它仍然为每个精度、召回率和 F1 给出了完美的 1。
When I arbitrarily slice the original dataset in half and use the resulting arrays as train and test data then the predictions for X_test are not perfect but every time I use the sklearn.train_test_split it returns a perfect classification report....So i assume I am doing something wrong with how I am using train_test_split?当我任意将原始数据集切成两半并使用生成的 arrays 作为训练和测试数据时,X_test 的预测并不完美,但每次我使用 sklearn.train_test_split 时,它都会返回一个完美的分类报告....所以我假设我我使用 train_test_split 的方式有问题吗?
(this really should be a comment but I don't have the reputation to allow for comments yet.) (这真的应该是一个评论,但我还没有允许评论的声誉。)
It's quite hard to say without having access to the data to try out.如果无法访问数据进行尝试,很难说。
I wonder if within the data itself, the class separation is really clear such that the classifier has no trouble distinguishing.我想知道在数据本身中,class 的分离是否真的很清楚,以至于分类器可以轻松区分。 (It seems so just seeing the values you printed.. The distributions are very different and well separated if you plot them. So to be fair a NN is overkill, if even by visual plotting we are able to clearly distinguish different activities.)
(看起来只是看到你打印的值。如果你 plot 它们的分布非常不同并且分离得很好。所以公平地说,神经网络是矫枉过正的,即使通过视觉绘图我们也能够清楚地区分不同的活动。)
Have you tried smaller hidden layer sizes, say only 1 or 2 nodes, or some other simpler classifier?您是否尝试过较小的隐藏层大小,比如只有 1 或 2 个节点,或其他一些更简单的分类器? Eg decision tree with
max_depth
set, say to <4, or just a logistic regression model.例如,设置了
max_depth
的决策树小于 4,或者只是逻辑回归 model。
Also did you try stratifying: train_test_split(X,y, train_size=0.10, stratify=y)
你也试过分层:
train_test_split(X,y, train_size=0.10, stratify=y)
My guess, I think it's just a very simple dataset, thus the classifier is doing very well because the class separations are so clear.我猜,我认为这只是一个非常简单的数据集,因此分类器表现非常好,因为 class 分离非常清晰。 So it's nothing to do with overfitting.
所以这与过度拟合无关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.