简体繁体中英

Questions on ensemble technique in machine learning

原文 2018-04-17 03:37:30 9 5 python/ machine-learning/ scikit-learn/ data-mining/ ensemble-learning

I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.

In this article , it mentions

Instead, model 2 may have a better overall performance on all the data points, but it has worse performance on the very set of points where model 1 is better. The idea is to combine these two models where they perform the best. This is why creating out-of-sample predictions have a higher chance of capturing distinct regions where each model performs the best.

But I still cannot get the point, why not train all training data can avoid the problem?

From this article , in the prediction section, it mentions

Simply, for a given input data point, all we need to do is to pass it through the M base-learners and get M number of predictions, and send those M predictions through the meta-learner as inputs

But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?

5 answers

Assume red and blue were the best models you could find.

One works better in region 1, the other on region 2.

Now you would also train a classifier to predict which model to use, ie, you would try to learn the two regions.

Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.

The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results. You can read this tutorial to learn more about ensembles and bagging and boosting.

1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold

2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.

For question 1 I will prove why we train two models in a contradictory way. Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model. But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.

It will be more clearer with the following metaphor. Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.

在问题 2 中，您必须将训练数据集拆分为 M 个数据集。并在训练期间将 M 个数据集提供给 M 个基础学习者。

Ensemble learning using neural network as combination technique

Ensemble of machine learning models in scikit-learn

How to change from normal machine learning technique to cross validation?

Ensemble learning with 2 classifiers

How do I combine/ensemble results of 3 machine learning models stored in 3 dataframes and output 1 dataframe with results agreed by majority?

Answering business questions with machine learning models (scikit or statsmodels)

Ensemble with voting in deep learning models

Is there a way to parallelize a loop for ensemble learning in Python?

Ensemble learning Python-Random Forest, SVM, KNN

Multinomial logistic regression for stacked ensemble learning with 2d input into regression

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Ensemble learning using neural network as combination technique Ensemble of machine learning models in scikit-learn How to change from normal machine learning technique to cross validation? Ensemble learning with 2 classifiers How do I combine/ensemble results of 3 machine learning models stored in 3 dataframes and output 1 dataframe with results agreed by majority? Answering business questions with machine learning models (scikit or statsmodels) Ensemble with voting in deep learning models Is there a way to parallelize a loop for ensemble learning in Python? Ensemble learning Python-Random Forest, SVM, KNN Multinomial logistic regression for stacked ensemble learning with 2d input into regression

Related Tags

Questions on ensemble technique in machine learning

Question

5 answers

solution1
2 ACCPTED 2018-04-17 06:39:37

solution2
1 2018-04-17 04:28:04

solution3
0 2018-04-17 06:37:27

solution4
0 2020-10-21 04:08:42

solution5
0 2020-10-21 04:20:47

Questions on ensemble technique in machine learning

Question

5 answers

solution1 2 ACCPTED 2018-04-17 06:39:37

solution2 1 2018-04-17 04:28:04

solution3 0 2018-04-17 06:37:27

solution4 0 2020-10-21 04:08:42

solution5 0 2020-10-21 04:20:47

solution1
2 ACCPTED 2018-04-17 06:39:37

solution2
1 2018-04-17 04:28:04

solution3
0 2018-04-17 06:37:27

solution4
0 2020-10-21 04:08:42

solution5
0 2020-10-21 04:20:47