简体   繁体   中英

Whats the impact of fit_transform in machine learning

We usually apply .fit_transform() on X_train and .transform() on X_test

This is because they are from the same dataset. What if we apply fit_transform() to the X_test again. How will this affect our model?

For example, if you're applying a SimpleImputer to impute numeric missing values with the mean, each time you call the fit_transform method you are:

  • calculating the mean for that variable(s)
  • substituting the missing with the calculated mean

Now, if you apply fit_transform to both train and test, it could give 2 different mean for each variable, thus resulting in 2 different data processes.

Moreover, here's another less statistical, more practical issue. If you deploy the process in production and apply this process to a single record, which "mean" will you use? The train one or the test one? Or would you apply fit_transform also to that record, calculating the mean of one?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM