简体   繁体   中英

How to calculate optimum feature numbers in PCA (Python)?

I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?

  • My first thought was to start at, for example, 5 and working my way up and calculating accuracy. However, for obvious reasons this wasn't a time effective means of calculating.

Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?

First look at the dataset distribution and then used explained_variance_ to find the number of components.

    1. Start with projecting your samples on a 2-D graph.
    • Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.

    •  from sklearn.datasets import fetch_olivetti_faces from sklearn.model_selection import train_test_split olivetti = fetch_olivetti_faces() x = olivetti.images # Train y = olivetti.target # Labels x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2])) x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2])) x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])
    • Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.

    •  from sklearn.decomposition import PCA from matplotlib.pyplot import figure, get_cmap, colorbar, show class_num = 40 sample_num = 10 pca = PCA(n_components=2).fit_transform(x) idx_range = class_num * sample_num fig = figure(figsize=(6, 3), dpi=300) ax = fig.add_subplot(1, 1, 1) c_map = get_cmap(name='jet', lut=class_num) scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1], c=y[:idx_range],s=10, cmap=c_map) ax.set_xlabel("First Principal Component") ax.set_ylabel("Second Principal Component") ax.set_title("PCA projection of {} people".format(class_num)) colorbar(mappable=scatter) show()
    • 在此处输入图像描述

    • We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.

    • Please remember we created this graph from the main dataset, neither train nor test.

  • How are many principal components we need to clearly distinguish the data?

    • To answer the above question we will be using explained_variance_ .

    • From the documentation :

      The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.

    •  from matplotlib.pyplot import plot, xlabel, ylabel pca2 = PCA().fit(x) plot(pca2.explained_variance_, linewidth=2) xlabel('Components') ylabel('Explained Variaces') show()
    • 在此处输入图像描述

    • From the above graph, we can see after 100 components PCA distinguishes the people.

Simplified-code:


from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM