简体繁体 English

如何有效地将Gensim语料库转换为numpy数组（或scipy稀疏矩阵）？

[英]How to convert Gensim corpus to numpy array (or scipy sparse matrix) efficiently?

原文 2015-12-31 03:14:20 2 1 python/ scikit-learn/ gensim

Suppose I have a (possibly) large corpus, about 2.5M of them with 500 features (after running LSI on the original data with gensim). 假设我有一个（可能）大型语料库，其中大约有2.5M具有500个功能（在使用gensim在原始数据上运行LSI之后）。 I need the corpus to train my classifiers using scikit-learn. 我需要语料库来训练我的分类器使用scikit-learn。 However, I need to first convert the corpus into a numpy array. 但是，我需要先将语料库转换为numpy数组。 The corpus creation and classifier trainer are done in two different scripts. 语料库创建和分类器训练器以两个不同的脚本完成。

So the problem is that, my collection size is expected to grow, and at this stage I already don't have enough memory (32GB on the machine) to convert all at once (with gensim.matutils.corpus2dense ). 所以问题是，我的集合大小预计会增长，在这个阶段我已经没有足够的内存（机器上有32GB）来同时转换所有内容（使用gensim.matutils.corpus2dense ）。 In order to work around the problem I am converting one vector after another at a time, but it is very slow. 为了解决这个问题，我一次又一次地转换一个矢量，但速度非常慢。

I have considered dumping the corpus into svmlight format, and have scikit-learn to load it with sklearn.datasets.load_svmlight_file . 我曾考虑将语料库转储为svmlight格式，并且scikit-learn使用sklearn.datasets.load_svmlight_file加载它。 But then it would probably mean I will need to load everything into memory at once? 但那么它可能意味着我需要立即将所有内容加载到内存中？

Is there anyway I can efficiently convert from gensim corpus to numpy array (or scipy sparse matrix)? 无论如何我可以有效地从gensim语料库转换为numpy数组（或scipy稀疏矩阵）？

1 个解决方案

I'm not very knowledgable about Gensim, so I hesitate to answer, but here goes: 我对Gensim不是很了解，所以我不愿意回答，但是这里有：

Your data does not fit in memory so you will have to either stream it (basically what you are doing now) or chunk it out. 您的数据不适合内存，因此您必须将其流式传输（基本上是您现在正在执行的操作）或将其分块。 It looks to me like gensim.utils.chunkize chunks it out for you, and you should be able to get the dense numpy array that you need with as_numpy=True . 在我看来，像gensim.utils.chunkize为你gensim.utils.chunkize它，你应该能够使用as_numpy=True获得所需的密集numpy数组。 You will have to use the sklearn models that support partial_fit . 您必须使用支持partial_fit的sklearn模型。 These are trained iteratively, one batch at a time. 这些是迭代训练的，一次一批。 The good ones are the SGD classifier and the Passive-Aggressive Classifier . 好的是SGD分类器和Passive-Aggressive分类器。 Make sure to pass the classes argument at least the first time you call partial_fit . 确保至少在第一次调用partial_fit时传递classes参数。 I recommend reading the docs on out-of-core scaling . 我建议阅读关于核外扩展的文档。