即使矩阵稀疏，scikit中的MemoryError

Question

I am running (or trying to) a script to classify documents. 我正在运行（或试图）对文档进行分类的脚本。 The code that is throwing the error is: 引发错误的代码是：

X = df['text'].values
Y = np.asarray(df['label'], dtype=np.dtype(int))

text_clf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,3), preprocessor=neg_preprocess, n_features=10000000)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='log', n_jobs=-1, penalty='elasticnet'))])

text_clf.fit(X, Y)

To get a sense of what the HashingVectorizer produces: 要了解HashingVectorizer产生的内容：

<375175x10000000 sparse matrix of type '<type 'numpy.float64'>'
    with 56324335 stored elements in Compressed Sparse Row format>

The full error and traceback is: 完整的错误和回溯是：

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-15-09ad11dfb82b> in <module>()
  7                      ('clf', SGDClassifier(loss='log', n_jobs=-1, penalty='elasticnet'))])
  8 
----> 9 text_clf.fit(X, Y)
 10 
 11 print datetime.now()-startTime

D:\Users\DB\Anaconda\lib\site-packages\sklearn\pipeline.pyc in fit(self, X, y, **fit_params)
129         """
130         Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 131         self.steps[-1][-1].fit(Xt, y, **fit_params)
132         return self
133 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in fit(self, X, y, coef_init, intercept_init, class_weight, sample_weight)
517                          coef_init=coef_init, intercept_init=intercept_init,
518                          class_weight=class_weight,
--> 519                          sample_weight=sample_weight)
520 
521 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in _fit(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, class_weight, sample_weight)
416 
417         self._partial_fit(X, y, alpha, C, loss, learning_rate, self.n_iter,
--> 418                           classes, sample_weight, coef_init, intercept_init)
419 
420         # fitting is over, we can now transform coef_ to fortran order

D:\Users\DB\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in _partial_fit(self, X, y, alpha, C, loss, learning_rate, n_iter, classes, sample_weight, coef_init, intercept_init)
359         if self.coef_ is None or coef_init is not None:
360             self._allocate_parameter_mem(n_classes, n_features,
--> 361                                          coef_init, intercept_init)
362 
363         self.loss_function = self._get_loss_function(loss)

D:\Users\DB\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in _allocate_parameter_mem(self, n_classes, n_features, coef_init, intercept_init)
187             else:
188                 self.coef_ = np.zeros((n_classes, n_features),
--> 189                                       dtype=np.float64, order="C")
190 
191             # allocate intercept_ for multi-class

MemoryError:

The size of the feature vector for the whole training set is pretty significant, but each document is quite short (~200 words) and has a small set of features. 整个训练集的特征向量的大小非常重要，但是每个文档都很短（约200个字），并且具有少量特征。 I would imagine that a sparse matrix would not have trouble handling the data, but perhaps I am completely wrong? 我可以想象稀疏矩阵不会在处理数据时遇到麻烦，但是也许我完全错了？ I monitored the resource consumption on my computer and it had plenty of RAM left when it failed. 我监视了计算机上的资源消耗，当它出现故障时还留有大量RAM。

Is there something in the code that is causing this error? 代码中是否存在导致此错误的内容？ I thought that maybe the TfidfTransformer() might be to blame because it causes statefulness, but I removed it from the pipeline and still had the same error. 我以为TfidfTransformer()可能会受到指责，因为它会导致有状态，但是我从管道中删除了它，但仍然有相同的错误。 If it's a problem with the feature vector size, surely there's a way to deal with large amounts of data... 如果特征向量大小有问题，那么肯定有一种方法可以处理大量数据...

I am using ipython notebook and python 2.7.6 Anaconda distribution. 我正在使用ipython notebook和python 2.7.6 Anaconda发行版。 If more information is needed to be helpful, please let me know. 如果需要更多信息以对您有所帮助，请告诉我。

Thanks in advance. 提前致谢。

Answer 1

I don't think it is the vectoriser as the traceback shows it fails on the following line: 我不认为它是矢量程序，因为回溯显示它在以下行中失败：

self.coef_ = np.zeros((n_classes, n_features), dtype=np.float64, order="C")

This allocates a dense numpy array, which uses a lot of memory. 这会分配一个密集的 numpy数组，该数组使用大量内存。 It's shape is (n_classes, n_features) , and n_features is the same n_features that you passed in as a parameter to the vectoriser, 10M! 它的形状是(n_classes, n_features)和n_features是一样的n_features你传过来的参数来vectoriser，10M！ How many classes do you have in your dataset? 您的数据集中有几类？

A quick and easy solution is to reduce the value of n_features . 一种快速简便的解决方案是减小n_features的值。 Alternatively, you can try other classifiers that do not convert the input to a dense array. 或者，您可以尝试不将输入转换为密集数组的其他分类器。 I don't know of the top of my head which of sklearn's classifiers do that though. 我不知道我的头顶是哪个sklearn的分类器。

PS This question shows how to determine the actual in-memory size of a matrix. PS 这个问题显示了如何确定矩阵的实际内存大小。 You can verify it is not the vectoriser or the tfidf transformer that are failing. 您可以确认不是矢量转换器或tfidf转换器出现了故障。

即使矩阵稀疏，scikit中的MemoryError

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-08-07 08:41:55

即使矩阵稀疏，scikit中的MemoryError

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-08-07 08:41:55

解决方案1
4 已采纳 2014-08-07 08:41:55