简体   繁体   English

将聚类输出拟合到机器学习模型中

[英]Fit clustering outputs into Machine Learning model

Just a machine learning/data science problem.只是一个机器学习/数据科学问题。

a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C). a)假设我有一个包含 20 个特征的数据集,我决定使用 3 个特征来执行无监督聚类学习 - 理想情况下这会产生 3 个集群(A、B 和 C)。

b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (ie now total of 21 features). b)然后我将该输出结果(集群 A、B 或 C)作为新特征(即现在总共 21 个特征)拟合回我的数据集。

c) I run a regression model to predict a label value with the 21 features. c)我运行一个回归模型来预测具有 21 个特征的标签值。

Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.想知道步骤b)是否是多余的(因为这些特征已经存在于早期的数据集中),我是否使用更强大的模型(随机森林,XGBoost),以及如何从数学上解释这一点。

Any opinions and suggestions will be great!任何意见和建议都会很棒!

Aha nice one!啊哈不错! You might think you are using two models, but actually you are combining two models into one, with skip connections.您可能认为您正在使用两个模型,但实际上您将两个模型合并为一个,并使用跳过连接。 As it is one model, there is no way knowing for sure what is the best architecture beforehand, per the No Free Lunch Theorem.由于它是一种模型,因此根据 No Free Lunch Theorem,无法事先确定最好的架构是什么。 So, practically, you have have to try it out, and mathematically, there's no knowing it beforehand, because of the No Free Lunch Theorem.因此,实际上,您必须尝试一下,并且从数学上讲,由于没有免费午餐定理,因此事先不知道。

Great idea: just give it a try and see how that goes.好主意:试一试,看看效果如何。 This is highly dependent on your dataset and model choice as you guessed.正如您所猜测的,这高度依赖于您的数据集和模型选择。 Hard to predict how adding this type of feature will behave, just like any other feature engineering.很难预测添加这种类型的特征会如何表现,就像任何其他特征工程一样。 But caution, in some cases it's not even improving your performance.但请注意,在某些情况下,它甚至不会提高您的性能。 See a test below where performance actually decreases, with Iris dataset:使用 Iris 数据集查看以下性能实际下降的测试:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn import metrics

# load data
iris = load_iris()
X = iris.data[:, :3]  # only keep three out of the four available features to make it more challenging
y = iris.target

# split train / test
indices = np.random.permutation(len(X))
N_test = 30
X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]

# compute a clustering method (here KMeans) based on available features in X_train
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
new_clustering_feature_train = kmeans.predict(X_train)
new_clustering_feature_test = kmeans.predict(X_test)

# create a new input train/test X with this feature added
X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])

Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature :现在让我们比较仅在X_trainX_train_with_clustering_feature上学习的两个模型:

model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
print(metrics.classification_report(model1.predict(X_test), y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.95      0.97      0.96        38
           2       0.97      0.95      0.96        37

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.98      0.97      0.97       120

And the other model:另一个模型:

model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))

           0       1.00      1.00      1.00        45
           1       0.87      0.97      0.92        35
           2       0.97      0.88      0.92        40

    accuracy                           0.95       120
   macro avg       0.95      0.95      0.95       120
weighted avg       0.95      0.95      0.95       120

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM