简体   繁体   English

使用稀疏稀疏矩阵和numpy数组训练sklearn ML模型

[英]Train `sklearn` ML model with scipy sparse matrix and numpy array

Just to explain some things more about my use case, A is a sparse matrix with tf-idf values and B is an array with some additional features of my data. 为了进一步说明我的用例, A是一个带有tf-idf值的稀疏矩阵, B是一个具有我的数据其他特征的数组。

I have already splitted to training and test sets so A and B in my example are only about the training set. 我已经分为训练和测试集,所以在我的示例中, AB仅与训练集有关。 I (want to) do the same for the test set after this code. 我(想)在这段代码之后对测试集做同样的事情。

I want to concatenate these matrices/arrays because then I want to pass them to a sklearn ML model to train it and I do not think that I can pass them separately. 我想将这些矩阵/数组连接起来,因为然后我想将它们传递给sklearn ML模型以对其进行训练,但我认为我不能单独传递它们。

So I tried to do this: 所以我尝试这样做:

C = np.concatenate((A, B.T), axis=1)

where A is a <class 'scipy.sparse.csr.csr_matrix'> and B is a <class 'numpy.ndarray'> . 其中A是<class 'scipy.sparse.csr.csr_matrix'>而B是<class 'numpy.ndarray'>

However, when I try to do this then I get the following error: 但是,当我尝试执行此操作时,出现以下错误:

ValueError: zero-dimensional arrays cannot be concatenated

Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because 另外,我认为用稀疏矩阵的numpy数组`np.concatenate`的想法对我而言不是很好,因为

  1. it is basically impossible to covert my sparse array A to a dense array because it is too big 基本上不可能将稀疏数组A转换为密集数组,因为它太大了
  2. I will lose (or not actually??) information if I convert my fully dense array B to a sparse array 如果将完全密集的数组B转换为稀疏数组,我将丢失(或实际上不会丢失?)信息

What is the best way to pass to an sklearn ML model a sparse and a fully dense array concatenated by rows? 将稀疏和完全密集的,由行连接的数组传递给sklearn ML模型的最佳方法是什么?

  1. You can use hstack from scipy. 您可以使用hstack从SciPy的。 hstack will convert both matrices to scipy coo_matrix , merge them and return a coo_matrix by default. hstack会将两个矩阵都转换为scipy coo_matrix ,合并它们并默认返回一个coo_matrix。

  2. No information is lost when converting dense array to sparse. 将密集数组转换为稀疏数组时,不会丢失任何信息。 Sparse matrices are just compact data storage format. 稀疏矩阵只是紧凑的数据存储格式。 Also, unless to specify a value for argument dtype of hstack everything is upcasted . 此外,除非指定参数的值dtypehstack一切upcasted So, there is no possibility of data loss there as well. 因此,那里也没有数据丢失的可能性。

Further , if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit method to work. 此外 ,如果您打算使用sklearn的Logistic回归,则稀疏矩阵必须采用csr格式才能fit方法起作用。

The following code should work for your use-case 以下代码适用于您的用例

from scipy.sparse import hstack

X = hstack((A, B), format='csr')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM