[英]Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required
I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial).我发现很难理解如何修复我创建的管道(阅读:主要从教程中粘贴)。 It's python 3.4.2:这是python 3.4.2:
df = pd.DataFrame
df = DataFrame.from_records(train)
test = [blah1, blah2, blah3]
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])
pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)
When I run it, I get:当我运行它时,我得到:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
.这是针对线pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
。
I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it.我已经通过 numpy、scipy 等尝试了很多解决方案,但我仍然不知道如何解决它。 And yes, similar questions have come up before, but not inside a pipeline.是的,以前也出现过类似的问题,但不是在管道内。 Where is it that I have to apply toarray
or todense
?我必须在哪里申请toarray
或todense
?
Unfortunately those two are incompatible.不幸的是,这两者是不相容的。 A CountVectorizer
produces a sparse matrix and the RandomForestClassifier requires a dense matrix. CountVectorizer
产生一个稀疏矩阵,而 RandomForestClassifier 需要一个密集矩阵。 It is possible to convert using X.todense()
.可以使用X.todense()
进行转换。 Doing this will substantially increase your memory footprint.这样做会大大增加您的内存占用。
Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense()
in a pipeline stage.以下是基于http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html执行此操作的示例代码,它允许您在管道阶段调用.todense()
。
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
Once you have your DenseTransformer
, you are able to add it as a pipeline step.拥有DenseTransformer
,您就可以将其添加为管道步骤。
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
('classifier', RandomForestClassifier())
])
Another option would be to use a classifier meant for sparse data like LinearSVC
.另一种选择是使用用于稀疏数据的分类器,如LinearSVC
。
from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])
The most terse solution would be use a FunctionTransformer
to convert to dense: this will automatically implement the fit
, transform
and fit_transform
methods as in David's answer.最简洁的解决方案是使用FunctionTransformer
转换为密集的:这将自动实现fit
、 transform
和fit_transform
方法,如大卫的回答。 Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline
convenience function to enable a more minimalist language for describing the model:此外,如果我的管道步骤不需要特殊名称,我喜欢使用sklearn.pipeline.make_pipeline
便利函数来启用更简约的语言来描述模型:
from sklearn.preprocessing import FunctionTransformer
pipeline = make_pipeline(
CountVectorizer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
RandomForestClassifier()
)
0.16-dev 中的随机森林现在接受稀疏数据。
you can change pandas Series
to arrays using the .values
method.您可以使用.values
方法将 pandas Series
更改为数组。
pipeline.fit(df[0].values, df[1].values)
However I think the issue here happens because CountVectorizer()
returns a sparse matrix by default, and cannot be piped to the RF classifier.但是我认为这里的问题是因为CountVectorizer()
默认返回一个稀疏矩阵,并且不能通过管道传输到 RF 分类器。 CountVectorizer()
does have a dtype
parameter to specify the type of array returned. CountVectorizer()
确实有一个dtype
参数来指定返回的数组类型。 That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long这就是说通常你需要做某种降维才能使用随机森林进行文本分类,因为词袋特征向量很长
with this pipeline add TfidTransformer plus使用此管道添加 TfidTransformer plus
pipelinEx = Pipeline([('bow',vectorizer),
('tfidf',TfidfTransformer()),
('to_dense', DenseTransformer()),
('classifier',classifier)])
The first line above, gets the word counts for the documents in a sparse matrix form.上面的第一行以稀疏矩阵形式获取文档的字数。 However, in practice, you may be computing tfidf scores with TfidfTransformer on a set of new unseen documents.但是,在实践中,您可能正在使用 TfidfTransformer 在一组新的未见文档上计算 tfidf 分数。 Then, by calling tfidf transformer.transform(vectorizer) you will finally be computing the tf-idf scores for your docs.然后,通过调用 tfidf transformer.transform(vectorizer),您最终将计算文档的 tf-idf 分数。 Internally this is computing the tf * idf multiplication where term frequency is weighted by its idf values.在内部,这是计算 tf * idf 乘法,其中术语频率由其 idf 值加权。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.