简体   繁体   English

将稀疏矩阵转换为密集矩阵时的MemoryError? (numpy,scikit)

[英]MemoryError while converting sparse matrix to dense matrix? (numpy, scikit)

lr = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None)

rd = AdaBoostClassifier( base_estimator=lr, 
                                           learning_rate=1, 
                                           n_estimators=20, 
                                           algorithm="SAMME")
##here, i am deleting unnecesseary objects
##print X.shape
##(7395, 412605)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

When i run this i get this error: 当我运行这个我得到这个错误:

TypeError: A sparse matrix was passed, but dense data is required. TypeError:传递了稀疏矩阵,但需要密集数据。 Use X.toarray() to convert to a dense numpy array. 使用X.toarray()转换为密集的numpy数组。

And then, i changed my code like this: 然后,我改变了我的代码:

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X.toarray(), y, cv=20, scoring='roc_auc'))

Now, i have the following exception: 现在,我有以下例外:

  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 559, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 235, in toarray
    B = self._process_toarray_args(order, out)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 628, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

Any suggestions to solve the issue? 有什么建议可以解决这个问题?

MemoryError means that there isn't enough RAM available on your system to allocate the matrix. MemoryError意味着系统上没有足够的RAM来分配矩阵。 Why? 为什么? Well, a 7395 x 412605 matrix has 3,051,213,975 elements. 好吧, 7395 x 412605矩阵有3,051,213,975个元素。 If they're in the default float64 (usually double in C) datatype, that's 22.7GB. 如果它们在默认的float64 (通常是C中的double )数据类型中,则为22.7GB。 If you convert to lower-precision float32 s (usually float in C), it'd be 11.4GB; 如果转换为精度较低的float32 (通常是C中的float ),则为11.4GB; maybe that's handle-able on your machine. 也许这可以在你的机器上操作。 It'll still be real slow, though. 不过,它仍然会很慢。

It seems that AdaBoostClassifier doesn't support sparse inputs (as you can see in the code here ). 似乎AdaBoostClassifier不支持稀疏输入(正如您在此处的代码中所见)。 I don't know offhand if dense representations are necessary for the algorithm or if it's just that the implementation assumed that. 我不知道算法是否需要密集表示,或者只是实现假设。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM