[英]Train `sklearn` ML model with scipy sparse matrix and numpy array
Just to explain some things more about my use case, A
is a sparse matrix with tf-idf values and B
is an array with some additional features of my data. 为了进一步说明我的用例,
A
是一个带有tf-idf值的稀疏矩阵, B
是一个具有我的数据其他特征的数组。
I have already splitted to training and test sets so A
and B
in my example are only about the training set. 我已经分为训练和测试集,所以在我的示例中,
A
和B
仅与训练集有关。 I (want to) do the same for the test set after this code. 我(想)在这段代码之后对测试集做同样的事情。
I want to concatenate these matrices/arrays because then I want to pass them to a sklearn
ML model to train it and I do not think that I can pass them separately. 我想将这些矩阵/数组连接起来,因为然后我想将它们传递给
sklearn
ML模型以对其进行训练,但我认为我不能单独传递它们。
So I tried to do this: 所以我尝试这样做:
C = np.concatenate((A, B.T), axis=1)
where A is a <class 'scipy.sparse.csr.csr_matrix'>
and B is a <class 'numpy.ndarray'>
. 其中A是
<class 'scipy.sparse.csr.csr_matrix'>
而B是<class 'numpy.ndarray'>
。
However, when I try to do this then I get the following error: 但是,当我尝试执行此操作时,出现以下错误:
ValueError: zero-dimensional arrays cannot be concatenated
Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because 另外,我认为用稀疏矩阵的numpy数组`np.concatenate`的想法对我而言不是很好,因为
A
to a dense array because it is too big A
转换为密集数组,因为它太大了 B
to a sparse array B
转换为稀疏数组,我将丢失(或实际上不会丢失?)信息 What is the best way to pass to an sklearn
ML model a sparse and a fully dense array concatenated by rows? 将稀疏和完全密集的,由行连接的数组传递给
sklearn
ML模型的最佳方法是什么?
You can use hstack
from scipy. 您可以使用
hstack
从SciPy的。 hstack
will convert both matrices to scipy coo_matrix
, merge them and return a coo_matrix by default. hstack
会将两个矩阵都转换为scipy coo_matrix
,合并它们并默认返回一个coo_matrix。
No information is lost when converting dense array to sparse. 将密集数组转换为稀疏数组时,不会丢失任何信息。 Sparse matrices are just compact data storage format.
稀疏矩阵只是紧凑的数据存储格式。 Also, unless to specify a value for argument
dtype
of hstack
everything is upcasted . 此外,除非指定参数的值
dtype
的hstack
一切upcasted 。 So, there is no possibility of data loss there as well. 因此,那里也没有数据丢失的可能性。
Further , if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit
method to work. 此外 ,如果您打算使用sklearn的Logistic回归,则稀疏矩阵必须采用csr格式才能
fit
方法起作用。
The following code should work for your use-case 以下代码适用于您的用例
from scipy.sparse import hstack
X = hstack((A, B), format='csr')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.