numpy.ndarray 稀疏矩阵到密集

Question

I want to run sklearn 's RandomForestClassifier on some data that is packed as a numpy.ndarray which happens to be sparse.我想在一些打包为numpy.ndarray的数据上运行sklearn的RandomForestClassifier ，而这些数据恰好是稀疏的。 Calling fit gives ValueError: setting an array element with a sequence.调用fit给出ValueError: setting an array element with a sequence. . . From other posts I understand that random forest cannot handle sparse data.从其他帖子我了解到随机森林无法处理稀疏数据。

I expected the object to have a todense method, but it doesn't.我希望该对象有一个todense方法，但它没有。

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

I tried wrapping it with a SciPy csr_matrix but that gives errors as well.我尝试用 SciPy csr_matrix包装它，但这也会出错。

Is there any way to make random forest accept this data?有什么办法可以让随机森林接受这些数据吗？ (not sure that dense would actually fit in memory, but that's another thing...) （不确定 dense 是否真的适合内存，但那是另一回事......）

EDIT 1编辑 1

The code generating the error is just this:产生错误的代码就是这样的：

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

As for the suggestion to use toarray() , ndarray does not have such method.至于使用toarray()的建议，ndarray 没有这样的方法。 AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array.此外，如前所述，对于这个特定数据，我需要数 TB 的内存来保存数组。 Is there an option to run RandomForestClassifier with a sparse array?是否可以选择使用稀疏数组运行RandomForestClassifier ？

EDIT 2编辑 2

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据，如此处所述Save / load scipy sparse csr_matrix in portable data format 。 When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时，应该保存更多数据。

Answer 1

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

means that your code, or something it calls, has done np.array(M) where M is a csr sparse matrix.意味着你的代码，或者它调用的东西，已经完成np.array(M)其中M是一个csr稀疏矩阵。 It just wraps that matrix in a object dtype array.它只是将该矩阵包装在一个对象 dtype 数组中。

To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:要在不采用稀疏矩阵的代码中使用稀疏矩阵，您必须先将它们转换为密集矩阵：

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error .但是考虑到非零元素的维度和数量，这种转换很可能会产生memory error 。

Answer 2

I believe you're looking for the toarray method, as shown in the documentation .我相信您正在寻找toarray方法，如文档中所示。

So you can do, eg, X_dense = X_train.toarray() .所以你可以这样做，例如X_dense = X_train.toarray() 。

Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).当然，然后您的计算机就会崩溃（除非您拥有必需的 22 TB RAM？）。

Answer 3

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据，如此处所述Save / load scipy sparse csr_matrix in portable data format 。 When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时，应该保存更多数据。

RandomForestClassifier can run using data in this format. RandomForestClassifier可以使用这种格式的数据运行。 The code has been running for 1:30h now, so hopefully it will actually finish:-)该代码现在已经运行了 1:30 小时，所以希望它能真正完成:-)

Answer 4

Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix.由于您已经使用 np.load 加载了一个 csr 矩阵，因此您需要将它从 np 数组转换回 csr 矩阵。 You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()你说你尝试用 csr_matrix 包装它，但这不是数组的内容，你需要所有的.all()

temp = csr_matrix(X_train.all())
X_train = temp.toarray()

numpy.ndarray 稀疏矩阵到密集

问题描述

4 个解决方案

解决方案1
8 2019-04-11 18:32:47

解决方案2
1 2019-04-11 16:54:49

解决方案3
0 已采纳 2019-04-14 09:24:49

解决方案4
0 2021-01-14 20:21:09

numpy.ndarray 稀疏矩阵到密集

问题描述

4 个解决方案

解决方案1 8 2019-04-11 18:32:47

解决方案2 1 2019-04-11 16:54:49

解决方案3 0 已采纳 2019-04-14 09:24:49

解决方案4 0 2021-01-14 20:21:09

解决方案1
8 2019-04-11 18:32:47

解决方案2
1 2019-04-11 16:54:49

解决方案3
0 已采纳 2019-04-14 09:24:49

解决方案4
0 2021-01-14 20:21:09