简体繁体 English

fit_transform 在机器学习中有什么影响

[英]Whats the impact of fit_transform in machine learning

原文 2020-09-03 14:23:54 0 1 machine-learning/ scikit-learn

We usually apply .fit_transform() on X_train and .transform() on X_test我们通常采用.fit_transform()上X_train和.transform()上X_test

This is because they are from the same dataset.这是因为它们来自相同的数据集。 What if we apply fit_transform() to the X_test again.如果我们再次将fit_transform()应用于fit_transform() X_test 。 How will this affect our model?这将如何影响我们的模型？

1 个解决方案

For example, if you're applying a SimpleImputer to impute numeric missing values with the mean, each time you call the fit_transform method you are:例如，如果您应用SimpleImputer来用均值fit_transform数字缺失值，则每次调用fit_transform方法时，您都是：

calculating the mean for that variable(s)计算该变量的平均值
substituting the missing with the calculated mean用计算的平均值代替缺失值

Now, if you apply fit_transform to both train and test, it could give 2 different mean for each variable, thus resulting in 2 different data processes.现在，如果您将fit_transform应用于训练和测试，它可以为每个变量提供 2 个不同的平均值，从而导致 2 个不同的数据处理。

Moreover, here's another less statistical, more practical issue.此外，这是另一个不太统计、更实际的问题。 If you deploy the process in production and apply this process to a single record, which "mean" will you use?如果您在生产中部署该流程并将此流程应用于单个记录，您将使用哪个“意思”？ The train one or the test one?火车一号还是测试一号？ Or would you apply fit_transform also to that record, calculating the mean of one?或者您fit_transform也将fit_transform应用于该记录，计算一个的平均值？