Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

Question

I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial).我发现很难理解如何修复我创建的管道（阅读：主要从教程中粘贴）。 It's python 3.4.2:这是python 3.4.2：

df = pd.DataFrame
df = DataFrame.from_records(train)

test = [blah1, blah2, blah3]

pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])

pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)

When I run it, I get:当我运行它时，我得到：

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1])) .这是针对线pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1])) 。

I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it.我已经通过 numpy、scipy 等尝试了很多解决方案，但我仍然不知道如何解决它。 And yes, similar questions have come up before, but not inside a pipeline.是的，以前也出现过类似的问题，但不是在管道内。 Where is it that I have to apply toarray or todense ?我必须在哪里申请toarray或todense ？

Answer 1

Unfortunately those two are incompatible.不幸的是，这两者是不相容的。 A CountVectorizer produces a sparse matrix and the RandomForestClassifier requires a dense matrix. CountVectorizer产生一个稀疏矩阵，而 RandomForestClassifier 需要一个密集矩阵。 It is possible to convert using X.todense() .可以使用X.todense()进行转换。 Doing this will substantially increase your memory footprint.这样做会大大增加您的内存占用。

Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense() in a pipeline stage.以下是基于http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html执行此操作的示例代码，它允许您在管道阶段调用.todense() 。

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

Once you have your DenseTransformer , you are able to add it as a pipeline step.拥有DenseTransformer ，您就可以将其添加为管道步骤。

pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
     ('classifier', RandomForestClassifier())
])

Another option would be to use a classifier meant for sparse data like LinearSVC .另一种选择是使用用于稀疏数据的分类器，如LinearSVC 。

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])

Answer 2

The most terse solution would be use a FunctionTransformer to convert to dense: this will automatically implement the fit , transform and fit_transform methods as in David's answer.最简洁的解决方案是使用FunctionTransformer转换为密集的：这将自动实现fit 、 transform和fit_transform方法，如大卫的回答。 Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline convenience function to enable a more minimalist language for describing the model:此外，如果我的管道步骤不需要特殊名称，我喜欢使用sklearn.pipeline.make_pipeline便利函数来启用更简约的语言来描述模型：

from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
     CountVectorizer(), 
     FunctionTransformer(lambda x: x.todense(), accept_sparse=True), 
     RandomForestClassifier()
)

Answer 3

0.16-dev 中的随机森林现在接受稀疏数据。

Answer 4

you can change pandas Series to arrays using the .values method.您可以使用.values方法将 pandas Series更改为数组。

pipeline.fit(df[0].values, df[1].values)

However I think the issue here happens because CountVectorizer() returns a sparse matrix by default, and cannot be piped to the RF classifier.但是我认为这里的问题是因为CountVectorizer()默认返回一个稀疏矩阵，并且不能通过管道传输到 RF 分类器。 CountVectorizer() does have a dtype parameter to specify the type of array returned. CountVectorizer()确实有一个dtype参数来指定返回的数组类型。 That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long这就是说通常你需要做某种降维才能使用随机森林进行文本分类，因为词袋特征向量很长

Answer 5

with this pipeline add TfidTransformer plus使用此管道添加 TfidTransformer plus

        pipelinEx = Pipeline([('bow',vectorizer),
                           ('tfidf',TfidfTransformer()),
                           ('to_dense', DenseTransformer()), 
                           ('classifier',classifier)])

The first line above, gets the word counts for the documents in a sparse matrix form.上面的第一行以稀疏矩阵形式获取文档的字数。 However, in practice, you may be computing tfidf scores with TfidfTransformer on a set of new unseen documents.但是，在实践中，您可能正在使用 TfidfTransformer 在一组新的未见文档上计算 tfidf 分数。 Then, by calling tfidf transformer.transform(vectorizer) you will finally be computing the tf-idf scores for your docs.然后，通过调用 tfidf transformer.transform(vectorizer)，您最终将计算文档的 tf-idf 分数。 Internally this is computing the tf * idf multiplication where term frequency is weighted by its idf values.在内部，这是计算 tf * idf 乘法，其中术语频率由其 idf 值加权。

Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

问题描述

5 个解决方案

解决方案1
66 已采纳 2015-02-07 17:03:01

解决方案2
27 2016-07-26 03:32:00

解决方案3
17 2015-02-21 17:05:07

解决方案4
4 2015-02-07 16:54:42

解决方案5
-2 2018-07-01 21:20:46

Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

问题描述

5 个解决方案

解决方案1 66 已采纳 2015-02-07 17:03:01

解决方案2 27 2016-07-26 03:32:00

解决方案3 17 2015-02-21 17:05:07

解决方案4 4 2015-02-07 16:54:42

解决方案5 -2 2018-07-01 21:20:46

解决方案1
66 已采纳 2015-02-07 17:03:01

解决方案2
27 2016-07-26 03:32:00

解决方案3
17 2015-02-21 17:05:07

解决方案4
4 2015-02-07 16:54:42

解决方案5
-2 2018-07-01 21:20:46