简体   繁体   中英

Different output while using fit_transform vs fit and transform from sklearn

The following code snippet illustrates the issue:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

(nrows, ncolumns) = (1912392, 131)

X = np.random.random((nrows, ncolumns))

pca = PCA(n_components=28, random_state=0)
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)

print((transformed_X1 != transformed_X2).sum()) # Gives output as 53546976


scalar = StandardScaler()
scaled_X1 = scalar.fit_transform(X)
scalar2 = scalar.fit(X)
scaled_X2 = scalar2.transform(X)

(scaled_X1 != scaled_X2).sum() # Gives output as 0

Can someone explain as to why the first output is not zero and the second output is?

Using this works:

pca = PCA(n_components=28, svd_solver = 'full')
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)

print(np.allclose(transformed_X1, transformed_X2))
True

Apparently svd_solver = 'random' (which is what 'auto' defaults to) has enough process difference between .fit(X).transform(X) and fit_transform(X) to give different results even with the same seed. Also remember floating point errors make == and /= unreliable judges of equality of different processes, so use np.allclose() .

It seems like StandardScaler.fit_transform() just directly uses .fit(X).transform(X) under the hood, so there were no floating point errors there to trip you up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM