简体   繁体   English

numpy.ndarray 稀疏矩阵到密集

[英]numpy.ndarray sparse matrix to dense

I want to run sklearn 's RandomForestClassifier on some data that is packed as a numpy.ndarray which happens to be sparse.我想在一些打包为numpy.ndarray的数据上运行sklearnRandomForestClassifier ,而这些数据恰好是稀疏的。 Calling fit gives ValueError: setting an array element with a sequence.调用fit给出ValueError: setting an array element with a sequence. . . From other posts I understand that random forest cannot handle sparse data.从其他帖子我了解到随机森林无法处理稀疏数据。

I expected the object to have a todense method, but it doesn't.我希望该对象有一个todense方法,但它没有。

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

I tried wrapping it with a SciPy csr_matrix but that gives errors as well.我尝试用 SciPy csr_matrix包装它,但这也会出错。

Is there any way to make random forest accept this data?有什么办法可以让随机森林接受这些数据吗? (not sure that dense would actually fit in memory, but that's another thing...) (不确定 dense 是否真的适合内存,但那是另一回事......)

EDIT 1编辑 1

The code generating the error is just this:产生错误的代码就是这样的:

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

As for the suggestion to use toarray() , ndarray does not have such method.至于使用toarray()的建议,ndarray 没有这样的方法。 AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array.此外,如前所述,对于这个特定数据,我需要数 TB 的内存来保存数组。 Is there an option to run RandomForestClassifier with a sparse array?是否可以选择使用稀疏数组运行RandomForestClassifier

EDIT 2编辑 2

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据,如此处所述Save / load scipy sparse csr_matrix in portable data format When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时,应该保存更多数据。

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

means that your code, or something it calls, has done np.array(M) where M is a csr sparse matrix.意味着你的代码,或者它调用的东西,已经完成np.array(M)其中M是一个csr稀疏矩阵。 It just wraps that matrix in a object dtype array.它只是将该矩阵包装在一个对象 dtype 数组中。

To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:要在不采用稀疏矩阵的代码中使用稀疏矩阵,您必须先将它们转换为密集矩阵:

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error .但是考虑到非零元素的维度和数量,这种转换很可能会产生memory error

I believe you're looking for the toarray method, as shown in the documentation .我相信您正在寻找toarray方法,如文档中所示。

So you can do, eg, X_dense = X_train.toarray() .所以你可以这样做,例如X_dense = X_train.toarray()

Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).当然,然后您的计算机就会崩溃(除非您拥有必需的 22 TB RAM?)。

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据,如此处所述Save / load scipy sparse csr_matrix in portable data format When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时,应该保存更多数据。

RandomForestClassifier can run using data in this format. RandomForestClassifier可以使用这种格式的数据运行。 The code has been running for 1:30h now, so hopefully it will actually finish:-)该代码现在已经运行了 1:30 小时,所以希望它能真正完成:-)

Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix.由于您已经使用 np.load 加载了一个 csr 矩阵,因此您需要将它从 np 数组转换回 csr 矩阵。 You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()你说你尝试用 csr_matrix 包装它,但这不是数组的内容,你需要所有的.all()

temp = csr_matrix(X_train.all())
X_train = temp.toarray()

ValueError:未能找到可以处理输入的数据适配器:<class 'numpy.ndarray'> , <class 'scipy.sparse.csr.csr_matrix'< div><div id="text_translate"><p> 请帮我解决这个问题</p><pre>X_train = np.asarray(X_train) y_train = np.asarray(y_train) X_test = np.asarray(X_test) y_test = np.asarray(y_test) history = model.fit(X_train, y_train, epochs=75, batch_size=batch_size, verbose=2, validation_data=(X_test, y_test), callbacks= [lrate])</pre><p> ValueError:无法找到可以处理输入的数据适配器:&lt;class 'numpy.ndarray'&gt;, &lt;class 'scipy.sparse.csr.csr_matrix' 即使我转换为 numpy 数组,但出现错误。 请帮忙。 谢谢你。</p></div></class></class> - ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, <class 'scipy.sparse.csr.csr_matrix'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将一个numpy ndarray添加到稀疏矩阵 - adding a numpy ndarray to a sparse matrix Numpy:将稀疏矩阵转换为ndarray - Numpy: Transform sparse matrix to ndarray 将熊猫转换为numpy.ndarray以获得sparse.hstack - Convert pandas to numpy.ndarray for sparse.hstack TypeError:&#39;numpy.ndarray&#39;对象不可调用-使用带状/稀疏矩阵 - TypeError: 'numpy.ndarray' object is not callable - working with banded/sparse matrices ValueError:未能找到可以处理输入的数据适配器:<class 'numpy.ndarray'> , <class 'scipy.sparse.csr.csr_matrix'< div><div id="text_translate"><p> 请帮我解决这个问题</p><pre>X_train = np.asarray(X_train) y_train = np.asarray(y_train) X_test = np.asarray(X_test) y_test = np.asarray(y_test) history = model.fit(X_train, y_train, epochs=75, batch_size=batch_size, verbose=2, validation_data=(X_test, y_test), callbacks= [lrate])</pre><p> ValueError:无法找到可以处理输入的数据适配器:&lt;class 'numpy.ndarray'&gt;, &lt;class 'scipy.sparse.csr.csr_matrix' 即使我转换为 numpy 数组,但出现错误。 请帮忙。 谢谢你。</p></div></class></class> - ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, <class 'scipy.sparse.csr.csr_matrix' 在python中添加ndarray和稀疏矩阵转换为稠密时的广播错误 - Broadcast error when adding ndarray and sparse matrix converted to dense in python numpy.ndarray的索引 - Index of numpy.ndarray numpy.ndarray不是可调用的 - numpy.ndarray is not a callable 子类化numpy.ndarray - Subclassing numpy.ndarray 将numpy.ndarray的列表转换为矩阵以执行乘法 - turning a list of numpy.ndarray to a matrix in order to perform multiplication
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM