如何使用 sklearn Pipeline 转换项目？

Question

I have a simple scikit-learn Pipeline of two steps: a TfIdfVectorizer followed by a LinearSVC .我有一个简单的 scikit-learn Pipeline ，分为两个步骤：一个TfIdfVectorizer然后是一个LinearSVC 。

I have fit the pipeline using my data.我已经使用我的数据拟合了管道。 All good.都好。

Now I want to transform (not predict!) an item, using my fitted pipeline .现在我想转换（不是预测！）一个项目，使用我拟合的pipeline 。

I tried pipeline.transform([item]) , but it gives a different result compared to pipeline.named_steps['tfidf'].transform([item]) .我尝试了pipeline.transform([item]) ，但与pipeline.named_steps['tfidf'].transform([item])相比，它给出了不同的结果。 Even the shape and type of the result is different: the first is a 1x3000 CSR matrix, the second a 1x15000 CSC matrix.甚至结果的形状和类型也不同：第一个是 1x3000 CSR 矩阵，第二个是 1x15000 CSC 矩阵。 Which one is correct?哪一个是正确的？ Why do they differ?他们为什么不同？

How do I transform items, ie get an item's vector representation before the final estimator, when using scikit-learn's Pipeline ?使用 scikit-learn 的Pipeline时，如何转换项目，即在最终估计器之前获得项目的向量表示？

Answer 1

You can't call a transform method on a pipeline which contains Non-transformer on last step.您不能在最后一步包含 Non-transformer 的管道上调用转换方法。 If you wan't to call transfrom on such pipeline last estimator must be a transformer.如果您不想在此类管道上调用 transfrom，则最后一个估算器必须是转换器。

Even method doc says so:甚至方法文档都这么说：

Applies transforms to the data, and the transform method of the final estimator.对数据应用变换，以及最终估计器的变换方法。 Valid only if the final estimator implements transform .仅当最终估算器实现transform时才有效。

Also, there is no method to use every estimator except last one.此外，除了最后一个估计器之外，没有任何方法可以使用每个估计器。 Thou you can make your own Pipeline, and inherit everything from scikit-learn's Pipeline, but add one method, something like:你可以创建自己的流水线，并从 scikit-learn 的流水线中继承所有内容，但添加一种方法，例如：

def just_transforms(self, X):
    """Applies all transforms to the data, without applying last 
       estimator.

    Parameters
    ----------
    X : iterable
        Data to predict on. Must fulfill input requirements of first step of
        the pipeline.
    """
    Xt = X
    for name, transform in self.steps[:-1]:
        Xt = transform.transform(Xt)
    return Xt

Answer 2

The reason why the results are different (and why calling transform even workds) is that LinearSVC also has a transform (now deprecated) that does feature selection结果不同的原因（以及为什么调用transform甚至workds）是LinearSVC还有一个做特征选择的transform（现在已弃用）

If you want to transform using just the first step, pipeline.named_steps['tfidf'].transform([item]) is the right thing to do.如果您只想使用第一步进行转换，则pipeline.named_steps['tfidf'].transform([item])是正确的做法。 If you would like to transform using all but the last step, olologin's answer provides the code.如果您想使用除最后一步之外的所有内容进行转换，olologin 的答案提供了代码。

By default, all steps of the pipeline are executed, so also the transform on the last step, which is the feature selection performed by the LinearSVC.默认情况下，流水线的所有步骤都被执行，最后一步的变换也是如此，这是由 LinearSVC 执行的特征选择。

如何使用 sklearn Pipeline 转换项目？

问题描述

2 个解决方案

解决方案1
14 已采纳 2015-11-02 06:06:52

解决方案2
7 2015-11-03 16:33:20

如何使用 sklearn Pipeline 转换项目？

问题描述

2 个解决方案

解决方案1 14 已采纳 2015-11-02 06:06:52

解决方案2 7 2015-11-03 16:33:20

解决方案1
14 已采纳 2015-11-02 06:06:52

解决方案2
7 2015-11-03 16:33:20