简体   繁体   中英

Differences between mlab PCA and sklearn PCA

I have a certain set of "2-dimensional" data which I have to study using a PCA decomposition. As a first step I tried using the matplotlib.mlab library:

import numpy as np
from matplotlib.mlab import PCA

data = np.loadtxt("Data.txt")
result = PCA(data)
#....

I then compared the scatter plot of "Data.txt" with the principal components found by mlab (stored in result.Wt). Result is the following: mlab attempt

As you can see result is not optimal. I therefore tried to do the same thing using the sklearn.decomposition libraries:

import numpy as np
from sklearn.decomposition import PCA

data = np.loadtxt("Data.txt")
pca = PCA(n_components=2,whiten=True)
pca.fit(data)

Results this time are much much better: sklearn attempt

I didn't really expect this much difference of results between these two libraries. My question is then: what are the possible reasons for such a big difference in my results?

As always with questions which are not reproducible ( data.txt ): let's guess!

  • matplotlibs PCA standardizes data by default
  • sklearn's PCA does not (and you also activated whitening; don't you want to compare these results?)

My guess here, in the matplotlib-case, is that you plotted the PCA-axes which are fitted on standardized data, but did plot the original data (obviously not centered at the mean as positive values on axes only).

So:

  • deactivate matplotlib's standardization
  • deactivate sklearn's whitening
  • and compare...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM