使用稀疏稀疏矩阵和numpy数组训练sklearn ML模型

Question

Just to explain some things more about my use case, A is a sparse matrix with tf-idf values and B is an array with some additional features of my data. 为了进一步说明我的用例， A是一个带有tf-idf值的稀疏矩阵， B是一个具有我的数据其他特征的数组。

I have already splitted to training and test sets so A and B in my example are only about the training set. 我已经分为训练和测试集，所以在我的示例中， A和B仅与训练集有关。 I (want to) do the same for the test set after this code. 我（想）在这段代码之后对测试集做同样的事情。

I want to concatenate these matrices/arrays because then I want to pass them to a sklearn ML model to train it and I do not think that I can pass them separately. 我想将这些矩阵/数组连接起来，因为然后我想将它们传递给sklearn ML模型以对其进行训练，但我认为我不能单独传递它们。

So I tried to do this: 所以我尝试这样做：

C = np.concatenate((A, B.T), axis=1)

where A is a <class 'scipy.sparse.csr.csr_matrix'> and B is a <class 'numpy.ndarray'> . 其中A是<class 'scipy.sparse.csr.csr_matrix'>而B是<class 'numpy.ndarray'> 。

However, when I try to do this then I get the following error: 但是，当我尝试执行此操作时，出现以下错误：

ValueError: zero-dimensional arrays cannot be concatenated

Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because 另外，我认为用稀疏矩阵的numpy数组`np.concatenate`的想法对我而言不是很好，因为

it is basically impossible to covert my sparse array A to a dense array because it is too big 基本上不可能将稀疏数组A转换为密集数组，因为它太大了
I will lose (or not actually??) information if I convert my fully dense array B to a sparse array 如果将完全密集的数组B转换为稀疏数组，我将丢失（或实际上不会丢失？）信息

What is the best way to pass to an sklearn ML model a sparse and a fully dense array concatenated by rows? 将稀疏和完全密集的，由行连接的数组传递给sklearn ML模型的最佳方法是什么？

Answer 1

You can use hstack from scipy. 您可以使用hstack从SciPy的。 hstack will convert both matrices to scipy coo_matrix , merge them and return a coo_matrix by default. hstack会将两个矩阵都转换为scipy coo_matrix ，合并它们并默认返回一个coo_matrix。
No information is lost when converting dense array to sparse. 将密集数组转换为稀疏数组时，不会丢失任何信息。 Sparse matrices are just compact data storage format. 稀疏矩阵只是紧凑的数据存储格式。 Also, unless to specify a value for argument dtype of hstack everything is upcasted . 此外，除非指定参数的值dtype的hstack一切upcasted 。 So, there is no possibility of data loss there as well. 因此，那里也没有数据丢失的可能性。

Further , if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit method to work. 此外，如果您打算使用sklearn的Logistic回归，则稀疏矩阵必须采用csr格式才能fit方法起作用。

The following code should work for your use-case 以下代码适用于您的用例

from scipy.sparse import hstack

X = hstack((A, B), format='csr')

使用稀疏稀疏矩阵和numpy数组训练sklearn ML模型

问题描述

1 个解决方案

解决方案1
2 2019-08-05 18:00:33

使用稀疏稀疏矩阵和numpy数组训练sklearn ML模型

问题描述

1 个解决方案

解决方案1 2 2019-08-05 18:00:33

解决方案1
2 2019-08-05 18:00:33