[英]numpy.ndarray sparse matrix to dense
I want to run sklearn
's RandomForestClassifier
on some data that is packed as a numpy.ndarray
which happens to be sparse.我想在一些打包为numpy.ndarray
的数据上运行sklearn
的RandomForestClassifier
,而这些数据恰好是稀疏的。 Calling fit
gives ValueError: setting an array element with a sequence.
调用fit
给出ValueError: setting an array element with a sequence.
. . From other posts I understand that random forest cannot handle sparse data.从其他帖子我了解到随机森林无法处理稀疏数据。
I expected the object to have a todense
method, but it doesn't.我希望该对象有一个todense
方法,但它没有。
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
with 141256894 stored elements in Compressed Sparse Row format>,
dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>
I tried wrapping it with a SciPy csr_matrix
but that gives errors as well.我尝试用 SciPy csr_matrix
包装它,但这也会出错。
Is there any way to make random forest accept this data?有什么办法可以让随机森林接受这些数据吗? (not sure that dense would actually fit in memory, but that's another thing...) (不确定 dense 是否真的适合内存,但那是另一回事......)
EDIT 1编辑 1
The code generating the error is just this:产生错误的代码就是这样的:
X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')
model = RandomForestClassifier()
model.fit(X_train, train_gt.target)
As for the suggestion to use toarray()
, ndarray does not have such method.至于使用toarray()
的建议,ndarray 没有这样的方法。 AttributeError: 'numpy.ndarray' object has no attribute 'toarray'
Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array.此外,如前所述,对于这个特定数据,我需要数 TB 的内存来保存数组。 Is there an option to run RandomForestClassifier
with a sparse array?是否可以选择使用稀疏数组运行RandomForestClassifier
?
EDIT 2编辑 2
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据,如此处所述Save / load scipy sparse csr_matrix in portable data format 。 When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时,应该保存更多数据。
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
with 141256894 stored elements in Compressed Sparse Row format>,
dtype=object)
means that your code, or something it calls, has done np.array(M)
where M
is a csr
sparse matrix.意味着你的代码,或者它调用的东西,已经完成np.array(M)
其中M
是一个csr
稀疏矩阵。 It just wraps that matrix in a object dtype array.它只是将该矩阵包装在一个对象 dtype 数组中。
To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:要在不采用稀疏矩阵的代码中使用稀疏矩阵,您必须先将它们转换为密集矩阵:
arr = M.toarray() # or M.A same thing
mat = M.todense() # to make a np.matrix
But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error
.但是考虑到非零元素的维度和数量,这种转换很可能会产生memory error
。
I believe you're looking for the toarray
method, as shown in the documentation .我相信您正在寻找toarray
方法,如文档中所示。
So you can do, eg, X_dense = X_train.toarray()
.所以你可以这样做,例如X_dense = X_train.toarray()
。
Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).当然,然后您的计算机就会崩溃(除非您拥有必需的 22 TB RAM?)。
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format .似乎应该使用 SciPy 的稀疏保存数据,如此处所述Save / load scipy sparse csr_matrix in portable data format 。 When using NumPy's save/load more data should have been saved.使用 NumPy 的保存/加载时,应该保存更多数据。
RandomForestClassifier
can run using data in this format. RandomForestClassifier
可以使用这种格式的数据运行。 The code has been running for 1:30h now, so hopefully it will actually finish:-)该代码现在已经运行了 1:30 小时,所以希望它能真正完成:-)
Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix.由于您已经使用 np.load 加载了一个 csr 矩阵,因此您需要将它从 np 数组转换回 csr 矩阵。 You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()
你说你尝试用 csr_matrix 包装它,但这不是数组的内容,你需要所有的.all()
temp = csr_matrix(X_train.all())
X_train = temp.toarray()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.