简体繁体 English

机器学习模型概括

[英]Machine Learning model generalisation

原文 2019-05-27 19:20:22 3 1 machine-learning/ model/ knime

I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. 我是机器学习的新手，我想就模型概括提出一个问题。 In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part. 就我而言，我将生产一些机械零件，并且我对控制输入参数以在最终零件上获得某些特性感兴趣。

More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). 更具体地说，我对8个参数（例如P1，P2，...，P8）感兴趣。 In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. 为了优化所需零件的数量以最大化所探索的参数组合，我将问题分为两组。 For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. 对于第一组作品，我将更改前四个参数（P1 ... P4），而其他参数将保持不变。 In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4). 在第二种情况下，我将做相反的操作（变量P5 ... P8和常量P1 ... P4）。

So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. 因此，我想知道是否有可能制作一个具有八个参数作为预测最终零件属性的输入的单个模型。 I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other. 我问是因为我没有一次改变所有8个变量，所以我认为也许我必须为每组参数做一个模型，而这2个不同模型的预测不可能相互关联。

Thanks in advance. 提前致谢。

1 个解决方案

In most cases having two different models will have a better accuracy then one big model. 在大多数情况下，使用两个不同的模型将比使用一个大模型具有更好的精度。 The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction. 原因是在局部模型中，该模型将仅查看4个特征，并且能够识别其中的模式以进行预测。

But this particular approach will most certainly fail to scale. 但是，这种特定方法肯定会无法扩展。 Right now you only have two sets of data but what if it increases and you have 20 sets of data. 现在，您只有两组数据，但是如果数据增加并且您有20组数据该怎么办。 It will not be possible for you to create and maintain 20 ML models in production. 您将无法在生产中创建和维护20 ML模型。

What works best for your case will need some experimentation. 最适合您的情况的方法需要进行一些实验。 Take a random sample from data and train ML models. 从数据中随机抽取样本并训练ML模型。 Take one big model and two local models and evaluate their performance. 选取一个大模型和两个局部模型并评估其性能。 Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. 不仅准确性，还有他们的F1得分，AUC-PR和ROC曲线也可以找出最适合您的方法。 If you do not see a major performance drop, then one big model for the entire dataset will be a better option. 如果您没有看到性能大幅下降，那么针对整个数据集的一个大型模型将是一个更好的选择。 If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models. 如果您知道您的数据将始终分为两组，并且您不关心可伸缩性，则可以使用两个本地模型。