简体   繁体   中英

sklearn: Do you need to create a new instance of a transformer for each set of data?

I'm new to data science and scikit-learn so I apologize if this is a basic question. Do we need to make a new instance of a sklearn class when we want to train on a new dataset? For example, I am currently doing:

transformer = PowerTransformer()
transformed1 = transformer.fit_transform(data1.to_numpy())

transformer = PowerTransformer()
transformed2 = transformer.fit_transform(data2.to_numpy()) 
...

I have a multiple sets of data that I want to transform so that I can run KNNImputer (again using this repeat declarative approach).

I read that the .fit method internally stores the lambdas that it used to fit the data passed in but do the stored lambdas get overwritten with each call to .fit or do they get influenced by the fit on the new data?

Would it be wrong to do:

transformer = PowerTransformer()
transformed1 = transformer.fit_transform(data1.to_numpy())
transformed2 = transformer.fit_transform(data2.to_numpy())
...

Thank you in advance!

no that wouldnt be wrong, in both cases you are first you are fitting to the data and than transform it. every time you use fit it overwrites the existing one. Here is an example:

a = np.array([[1, 3], 
              [np.nan, 2], 
              [5, 9]])

c = np.array([[3, 4], 
              [6, 12], 
              [8, np.nan]])

imp = SimpleImputer(strategy="mean")
a1 = imp.fit_transform(a)
c1 = imp.fit_transform(c)

Now lets look at outputs:

a1: array([[1., 3.],
           [3., 2.],
           [5., 9.]])

c1: array([[ 3.,  4.],
           [ 6., 12.],
           [ 8.,  8.]])

Takes the mean of both columns(as sklearn doc. says) and imputes the mean. This should work the same in KNNImputer too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM