We usually apply .fit_transform()
on X_train
and .transform()
on X_test
This is because they are from the same dataset. What if we apply fit_transform()
to the X_test
again. How will this affect our model?
For example, if you're applying a SimpleImputer
to impute numeric missing values with the mean, each time you call the fit_transform
method you are:
Now, if you apply fit_transform
to both train and test, it could give 2 different mean for each variable, thus resulting in 2 different data processes.
Moreover, here's another less statistical, more practical issue. If you deploy the process in production and apply this process to a single record, which "mean" will you use? The train one or the test one? Or would you apply fit_transform
also to that record, calculating the mean of one?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.