I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?
Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?
First look at the dataset distribution and then used explained_variance_
to find the number of components.
Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.
from sklearn.datasets import fetch_olivetti_faces from sklearn.model_selection import train_test_split olivetti = fetch_olivetti_faces() x = olivetti.images # Train y = olivetti.target # Labels x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2])) x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2])) x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])
Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.
from sklearn.decomposition import PCA from matplotlib.pyplot import figure, get_cmap, colorbar, show class_num = 40 sample_num = 10 pca = PCA(n_components=2).fit_transform(x) idx_range = class_num * sample_num fig = figure(figsize=(6, 3), dpi=300) ax = fig.add_subplot(1, 1, 1) c_map = get_cmap(name='jet', lut=class_num) scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1], c=y[:idx_range],s=10, cmap=c_map) ax.set_xlabel("First Principal Component") ax.set_ylabel("Second Principal Component") ax.set_title("PCA projection of {} people".format(class_num)) colorbar(mappable=scatter) show()
We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.
Please remember we created this graph from the main dataset, neither train nor test.
How are many principal components we need to clearly distinguish the data?
To answer the above question we will be using explained_variance_
.
From the documentation :
The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.
from matplotlib.pyplot import plot, xlabel, ylabel pca2 = PCA().fit(x) plot(pca2.explained_variance_, linewidth=2) xlabel('Components') ylabel('Explained Variaces') show()
From the above graph, we can see after 100 components PCA distinguishes the people.
Simplified-code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.