简体   繁体   English

使用来自 PCA 的组件进行分类

[英]Classify using components from PCA

I used PCA analysis on my dataset like so:我对我的数据集使用了 PCA 分析,如下所示:

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(scale_x)
principalDf = pd.DataFrame(data=principalComponents, columns = ['PC1', 'PC2', 'PC3'])

and then on visualizing the results with MatPlotLib - I can see a division between my two classes like so:然后用 MatPlotLib 可视化结果 - 我可以看到我的两个类之间的划分如下:

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(principalDf['PC1'].values, principalDf['PC2'].values, principalDf['PC3'].values, c=['red' if m==0 else 'green' for m in y], marker='o')

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')

plt.show()

PCA 3D 绘图

but then when I use a classification model like SVM or Logistic Regression, it is unable to learn this relation:但是当我使用分类 model (如 SVM 或 Logistic 回归)时,它无法学习这种关系:

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver = 'lbfgs')
lg.fit(principalDf.values, y)
lg_p = lg.predict(principalDf.values)
print(classification_report(y, lg_p, target_names=['Failure', 'Success']))
 precision recall f1-score support Failure 1.00 0.03 0.06 67 Success 0.77 1.00 0.87 219 accuracy 0.77 286 macro avg 0.89 0.51 0.46 286 weighted avg 0.82 0.77 0.68 286

What could be the reason for this?这可能是什么原因?

First, use three features PC1, PC2, PC3.首先,使用三个特征PC1、PC2、PC3。 Additional features (PC4 ~ PC6), which is not expressed in the graph, may affects the classification result.图中未表示的附加特征(PC4 ~ PC6)可能会影响分类结果。

Second, a classifier sometimes is not trained well as you think.其次,分类器有时没有像你想象的那样训练好。 I recommend to use decision tree instead of the classifiers you use, because tree is (horizon) linear classifier and it would be yield the result you think.我建议使用决策树而不是您使用的分类器,因为树是(水平)线性分类器,它会产生您认为的结果。

Regardless of whether your results are making sense or not, you're doing something fundamentally wrong here, which is to train the classifier on the entire dataset and testing results on seen data .无论您的结果是否有意义,您都在做一些根本错误的事情,即在整个数据集上训练分类器并在所见数据上测试结果。 I've reproduced your problem using the iris dataset, and fitting a logistic regressor, has yielded good results for me:我已经使用 iris 数据集重现了您的问题,并拟合了一个逻辑回归器,对我产生了很好的结果:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

data = load_iris()
X = data.data
y = data.target

pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data=principalComponents, columns = ['PC1', 'PC2', 'PC3'])

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(principalDf['PC1'].values, 
           principalDf['PC2'].values, 
           principalDf['PC3'].values, 
           c=[['red', 'green', 'blue'][m] for m in y], marker='o')

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')

plt.show()

在此处输入图像描述

Now if we try to predict on X_test , we see that the confusion matrix is looking quite good in this case, meaning the the overall idea should work well:现在,如果我们尝试在X_test上进行预测,我们会看到在这种情况下混淆矩阵看起来相当不错,这意味着整体思路应该很好:

X_train, X_test, y_train, y_test = train_test_split(principalDf, y)

lg = LogisticRegression(solver = 'lbfgs')
lg.fit(X_train, y_train)
y_pred = lg.predict(X_test)

confusion_matrix(y_true=y_test, y_pred=y_pred)

array([[ 9,  0,  0],
       [ 0, 12,  1],
       [ 0,  0, 16]], dtype=int64)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM