How to calculate optimum feature numbers in PCA (Python)?

Question

I'm performing PCA preprocessing on a dataset of 78 variables. How would I calculate the optimal value of PCA variables?

My first thought was to start at, for example, 5 and working my way up and calculating accuracy. However, for obvious reasons this wasn't a time effective means of calculating.

Does anyone have any suggestions/experience? Or even a methodology for calculating the optimal value?

Answer 1

First look at the dataset distribution and then used explained_variance_ to find the number of components.

Start with projecting your samples on a 2-D graph.

Assume I have a face dataset (Olivetti-faces) 40 people and each person has 10 samples. Overall 400 images. We will split 280 trains and 120 test samples.

 from sklearn.datasets import fetch_olivetti_faces from sklearn.model_selection import train_test_split olivetti = fetch_olivetti_faces() x = olivetti.images # Train y = olivetti.target # Labels x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) x_train = x_train.reshape((x_train.shape[0], x.shape[1] * x.shape[2])) x_test = x_test.reshape((x_test.shape[0], x.shape[1] * x.shape[2])) x = x.reshape((x.shape[0]), x.shape[1] * x.shape[2])

Now we want to see how pixels are distributed. To understand clearly, we will display the pixels in a 2-D graph.

 from sklearn.decomposition import PCA from matplotlib.pyplot import figure, get_cmap, colorbar, show class_num = 40 sample_num = 10 pca = PCA(n_components=2).fit_transform(x) idx_range = class_num * sample_num fig = figure(figsize=(6, 3), dpi=300) ax = fig.add_subplot(1, 1, 1) c_map = get_cmap(name='jet', lut=class_num) scatter = ax.scatter(pca[:idx_range, 0], pca[:idx_range, 1], c=y[:idx_range],s=10, cmap=c_map) ax.set_xlabel("First Principal Component") ax.set_ylabel("Second Principal Component") ax.set_title("PCA projection of {} people".format(class_num)) colorbar(mappable=scatter) show()

We can say 40 people, each with 10 samples are not distinguishable with only 2 principal components.
Please remember we created this graph from the main dataset, neither train nor test.

How are many principal components we need to clearly distinguish the data?
- To answer the above question we will be using explained_variance_ .
- From the documentation :
  
  The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.
- ```
 from matplotlib.pyplot import plot, xlabel, ylabel pca2 = PCA().fit(x) plot(pca2.explained_variance_, linewidth=2) xlabel('Components') ylabel('Explained Variaces') show()
```
- From the above graph, we can see after 100 components PCA distinguishes the people.

Simplified-code:

from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

x, _ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_, linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()

How to calculate optimum feature numbers in PCA (Python)?

Question

1 answers

solution1
4 2020-09-22 20:48:13

How to calculate optimum feature numbers in PCA (Python)?

Question

1 answers

solution1 4 2020-09-22 20:48:13

solution1
4 2020-09-22 20:48:13