简体   繁体   中英

numpy and sklearn PCA return different covariance vector

Trying to learn PCA through and through but interestingly enough when I use numpy and sklearn I get different covariance matrix results.

The numpy results match this explanatory text here but the sklearn results different from both.

Is there any reason why this is so?

d = pd.read_csv("example.txt", header=None, sep = " ")
print(d)
      0     1
0  0.69  0.49
1 -1.31 -1.21
2  0.39  0.99
3  0.09  0.29
4  1.29  1.09
5  0.49  0.79
6  0.19 -0.31
7 -0.81 -0.81
8 -0.31 -0.31
9 -0.71 -1.01

Numpy Results

print(np.cov(d, rowvar = 0))
[[ 0.61655556  0.61544444]
 [ 0.61544444  0.71655556]]

sklearn Results

from sklearn.decomposition import PCA
clf = PCA()
clf.fit(d.values)
print(clf.get_covariance())

[[ 0.5549  0.5539]
 [ 0.5539  0.6449]]

Because for np.cov ,

Default normalization is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is 1, then normalization is by N.

Set bias=1 , the result is the same as PCA :

In [9]: np.cov(df, rowvar=0, bias=1)
Out[9]:
array([[ 0.5549,  0.5539],
       [ 0.5539,  0.6449]])

So I've encountered the same issue, and I think that it returns different values because the covariance is calculated in a different way. According to the sklearn documentation , the get_covariance() method, uses the noise variances to obtain the covariance matrix.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM