简体   繁体   English

fit_transform 在机器学习中有什么影响

[英]Whats the impact of fit_transform in machine learning

We usually apply .fit_transform() on X_train and .transform() on X_test我们通常采用.fit_transform()X_train.transform()X_test

This is because they are from the same dataset.这是因为它们来自相同的数据集。 What if we apply fit_transform() to the X_test again.如果我们再次将fit_transform()应用于fit_transform() X_test How will this affect our model?这将如何影响我们的模型?

For example, if you're applying a SimpleImputer to impute numeric missing values with the mean, each time you call the fit_transform method you are:例如,如果您应用SimpleImputer来用均值fit_transform数字缺失值,则每次调用fit_transform方法时,您都是:

  • calculating the mean for that variable(s)计算该变量的平均值
  • substituting the missing with the calculated mean用计算的平均值代替缺失值

Now, if you apply fit_transform to both train and test, it could give 2 different mean for each variable, thus resulting in 2 different data processes.现在,如果您将fit_transform应用于训练和测试,它可以为每个变量提供 2 个不同的平均值,从而导致 2 个不同的数据处理。

Moreover, here's another less statistical, more practical issue.此外,这是另一个不太统计、更实际的问题。 If you deploy the process in production and apply this process to a single record, which "mean" will you use?如果您在生产中部署该流程并将此流程应用于单个记录,您将使用哪个“意思”? The train one or the test one?火车一号还是测试一号? Or would you apply fit_transform also to that record, calculating the mean of one?或者您fit_transform也将fit_transform应用于该记录,计算一个的平均值?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM