在sklearn机器学习工具链中找到最佳算法组合

Question

In sklearn it is possible to create a pipeline to optimize the complete tool chain of a machine learning setup, as shown in the following sample: 在sklearn中，可以创建管道来优化机器学习设置的完整工具链，如以下示例所示：

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)

Now a pipeline represents by definition a serial process. 现在，管道根据定义表示一个串行进程。 But what if I want to compare different algorithms on the same level of a pipeline? 但是，如果我想在管道的同一级别上比较不同的算法呢？ Say I want to try another feature transformation algorithm additionally to PCA and another machine learning algorithm such as trees additionally to SVM, and get the best of the 4 possible combinations? 假设我想尝试另外的PCA和另一种机器学习算法（如SVM之外的树）的另一种特征转换算法，并获得4种可能组合中的最佳组合？ Can this be represented by some kind of parallel pipe or is there a meta algorithm for this in sklearn? 这可以用某种并行管道来表示，还是在sklearn中有一个元算法？

Answer 1

The pipeline is not a parallel process. 管道不是并行过程。 It's rather sequential (Pipe line ) - see here the documentation, mentionning : 这是相当顺序（管道） -见这里的文档，mentionning：

Sequentially apply a list of transforms and a final estimator. 按顺序应用变换列表和最终估算器。 [...] The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. [...]管道的目的是组装几个步骤，这些步骤可以在设置不同参数的同时进行交叉验证。

Thus, you should create two pipelines by just changing one parameters. 因此，您只需更改一个参数即可创建两个管道。 Then, you would be able to compare the results and keep the better. 然后，您将能够比较结果并保持更好。 If you want to, let's say, compare more estimators, you can automize the process 如果您想要比较更多的估算器，您可以自动化该过程

Here is a simple example : 这是一个简单的例子：

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.decomposition import PCA

clf1 = SVC(Kernel = 'rbf')
clf2 = RandomForestClassifier()

feat_selec1 = SelectKBest(f_regression)
feat_selec2 = PCA() 

for selec in [('SelectKBest', feat_selec1), ('PCA', feat_select2)]:
    for clf in [('SVC', clf1), ('RandomForest', clf2):
        pipe = Pipeline([selec, clf])
        //Do your training / testing cross_validation

Answer 2

A pipeline is something sequential: 管道是顺序的：

Data -> Process input with algorithm A -> Process input with algorithm B -> ...

Something parallel, and I also think what you're looking for is called an "Ensemble". 平行的东西，我也认为你所寻找的东西被称为“合奏”。 For example, in a classification context you can train several SVMs but on different features: 例如，在分类上下文中，您可以训练多个SVM，但具有不同的功能：

      |-SVM A gets features x_1, ... x_n       -> vote for class 1 -|
DATA -|-SVM B gets features x_{n+1}, ..., x_m  -> vote for class 1 -| -> Classify
      |-SVM C gets features x_{m+1}, ..., x_p  -> vote for class 0 -|

In this small example 2 of 3 classifiers voted for class 1, the 3rd voted for class 0. So by majority vote, the ensemble classifies the data as class 1. (Here, the classifiers are executed in parallel) 在这个小例子中，3个分类器中有2个投票给了1级，第3个投票给了0级。因此，通过多数投票，整体将数据分类为1级。（这里，分类器是并行执行的）

Of course, you can have several pipelines in an ensemble. 当然，你可以在一个集合中有几个管道。

See sklearns Ensemble methods for a pretty good summary. 有关非常好的总结，请参阅sklearns Ensemble方法。

A short image summary I made a while ago for different ensemble methods: 我刚才提出的一个简短的图像摘要，用于不同的集合方法：

在sklearn机器学习工具链中找到最佳算法组合

问题描述

2 个解决方案

解决方案1
2 2016-07-28 12:47:34

解决方案2
1 2016-07-28 12:44:21

在sklearn机器学习工具链中找到最佳算法组合

问题描述

2 个解决方案

解决方案1 2 2016-07-28 12:47:34

解决方案2 1 2016-07-28 12:44:21

解决方案1
2 2016-07-28 12:47:34

解决方案2
1 2016-07-28 12:44:21