简体   繁体   English

Scikit-Learn 的流水线:传递了一个稀疏矩阵,但需要密集数据

[英]Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required

I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial).我发现很难理解如何修复我创建的管道(阅读:主要从教程中粘贴)。 It's python 3.4.2:这是python 3.4.2:

df = pd.DataFrame
df = DataFrame.from_records(train)

test = [blah1, blah2, blah3]

pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])

pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)

When I run it, I get:当我运行它时,我得到:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1])) .这是针对线pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))

I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it.我已经通过 numpy、scipy 等尝试了很多解决方案,但我仍然不知道如何解决它。 And yes, similar questions have come up before, but not inside a pipeline.是的,以前也出现过类似的问题,但不是在管道内。 Where is it that I have to apply toarray or todense ?我必须在哪里申请toarraytodense

Unfortunately those two are incompatible.不幸的是,这两者是不相容的。 A CountVectorizer produces a sparse matrix and the RandomForestClassifier requires a dense matrix. CountVectorizer产生一个稀疏矩阵,而 RandomForestClassifier 需要一个密集矩阵。 It is possible to convert using X.todense() .可以使用X.todense()进行转换。 Doing this will substantially increase your memory footprint.这样做会大大增加您的内存占用。

Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense() in a pipeline stage.以下是基于http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html执行此操作的示例代码,它允许您在管道阶段调用.todense()

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

Once you have your DenseTransformer , you are able to add it as a pipeline step.拥有DenseTransformer ,您就可以将其添加为管道步骤。

pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
     ('classifier', RandomForestClassifier())
])

Another option would be to use a classifier meant for sparse data like LinearSVC .另一种选择是使用用于稀疏数据的分类器,如LinearSVC

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])

The most terse solution would be use a FunctionTransformer to convert to dense: this will automatically implement the fit , transform and fit_transform methods as in David's answer.最简洁的解决方案是使用FunctionTransformer转换为密集的:这将自动实现fittransformfit_transform方法,如大卫的回答。 Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline convenience function to enable a more minimalist language for describing the model:此外,如果我的管道步骤不需要特殊名称,我喜欢使用sklearn.pipeline.make_pipeline便利函数来启用更简约的语言来描述模型:

from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
     CountVectorizer(), 
     FunctionTransformer(lambda x: x.todense(), accept_sparse=True), 
     RandomForestClassifier()
)

0.16-dev 中的随机森林现在接受稀疏数据。

you can change pandas Series to arrays using the .values method.您可以使用.values方法将 pandas Series更改为数组。

pipeline.fit(df[0].values, df[1].values)

However I think the issue here happens because CountVectorizer() returns a sparse matrix by default, and cannot be piped to the RF classifier.但是我认为这里的问题是因为CountVectorizer()默认返回一个稀疏矩阵,并且不能通过管道传输到 RF 分类器。 CountVectorizer() does have a dtype parameter to specify the type of array returned. CountVectorizer()确实有一个dtype参数来指定返回的数组类型。 That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long这就是说通常你需要做某种降维才能使用随机森林进行文本分类,因为词袋特征向量很长

with this pipeline add TfidTransformer plus使用此管道添加 TfidTransformer plus

        pipelinEx = Pipeline([('bow',vectorizer),
                           ('tfidf',TfidfTransformer()),
                           ('to_dense', DenseTransformer()), 
                           ('classifier',classifier)])

The first line above, gets the word counts for the documents in a sparse matrix form.上面的第一行以稀疏矩阵形式获取文档的字数。 However, in practice, you may be computing tfidf scores with TfidfTransformer on a set of new unseen documents.但是,在实践中,您可能正在使用 TfidfTransformer 在一组新的未见文档上计算 tfidf 分数。 Then, by calling tfidf transformer.transform(vectorizer) you will finally be computing the tf-idf scores for your docs.然后,通过调用 tfidf transformer.transform(vectorizer),您最终将计算文档的 tf-idf 分数。 Internally this is computing the tf * idf multiplication where term frequency is weighted by its idf values.在内部,这是计算 tf * idf 乘法,其中术语频率由其 idf 值加权。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Scikit-learn的管道:多标签分类出错。 稀疏矩阵通过 - Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed 稀疏矩阵上的scikit-learn Normalizer过程 - scikit-learn Normalizer process on sparse matrix 稀疏矩阵上的scikit-learn HashingVectorizer - scikit-learn HashingVectorizer on sparse matrix scikit-learn管道 - scikit-learn pipeline SciPy NumPy和SciKit-learn,创建一个稀疏矩阵 - SciPy NumPy and SciKit-learn , create a sparse matrix 传递了一个稀疏矩阵,但需要密集数据。 使用 X.toarray() 转换为密集的 numpy 数组 - A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array TypeError:传递了稀疏矩阵,但需要密集数据(多标记K最近邻居) - TypeError: A sparse matrix was passed, but dense data is required (multilabel K nearest neighbours) Scikit-learn 混淆矩阵 - Scikit-learn confusion matrix 在scikit-learn管道中插入CalibratedClassifierCV的正确方法是什么? - What's the right way to insert a CalibratedClassifierCV in a scikit-learn pipeline? 类型错误:传递了稀疏矩阵,但需要密集数据。 使用 X.toarray() 转换为密集的 numpy 数组。 使用 NaiveBayes 分类器 - TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. with NaiveBayes Classifier
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM