I have 2 questions concerning PCA when using scikit.
Lets suppose I have the following data:
fullmatrix =[[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2.0, 1.6],
[1.0, 1.1],
[1.5, 1.6],
[1.1, 0.9]]
Now I do the PCA calculations:
from sklearn.decomposition import PCA as PCA
sklearn_pca = PCA()
Y_sklearn = sklearn_pca.fit_transform(fullmatrix)
print Y_sklearn # Y_sklearn is now the Data transformed with 2 eigenvectors
sklearn_pca.explained_variance_ratio_ # variance explained by each eigenvector
print sklearn_pca.explained_variance_ratio_
sklearn_pca.components_ # eigenvectors order by highest eigenvalue
print sklearn_pca.components_
First question: How can I project back this Y_sklearn into the original scale? (I know we should get back the same data as of full matrix as I'm using all eigenvectors, its just to check if done right).
Second question: How can I enter a threshold regarding minimum acceptable total variance coming from "sklearn_pca.explained_variance_ratio_"?. For example lets say I want to keep using eigenvectors until when i reach total explained_variance_ratio_ above 95%. In this case is easy, we just use the first eigenvector as it explains .96318131%. But how can we do this in a more automated way?
First: sklearn_pca.inverse_transform(Y_sklearn)
Second:
thr = 0.95
# Is cumulative sum exceeds some threshold
is_exceeds = np.cumsum(sklearn_pca.explained_variance_ratio_) >= thr
# Which minimal index provides such variance
# We need to add 1 to get minimum number of eigenvectors for saving this variance
k = np.min(np.where(is_exceeds))+1
# Or you can just initialize your model with thr parameter
sklearn_pca = PCA(n_components = thr)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.