简体   繁体   English

如何将稀疏矩阵拆分为训练集和测试集?

[英]How to split sparse matrix into train and test sets?

I want to understand how to work with sparse matrices.我想了解如何处理稀疏矩阵。 I have this code to generate multi-label classification data set as a sparse matrix.我有这个代码来生成多标签分类数据集作为一个稀疏矩阵。

from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse = True, n_labels = 20, return_indicator = 'sparse', allow_unlabeled = False)

This code gives me X in the following format:此代码以以下格式为我提供 X:

<100x20 sparse matrix of type '<class 'numpy.float64'>' 
with 1797 stored elements in Compressed Sparse Row format>

y: y:

<100x5 sparse matrix of type '<class 'numpy.int64'>'
with 471 stored elements in Compressed Sparse Row format>

Now I need to split X and y into X_train, X_test, y_train and y_test, so that train set consitutes 70%.现在我需要将 X 和 y 拆分为 X_train、X_test、y_train 和 y_test,以便训练集构成 70%。 How can I do it?我该怎么做?

This is what I tried:这是我尝试过的:

X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, stratify=y, test_size=0.3)

and got the error message:并收到错误消息:

TypeError: A sparse matrix was passed, but dense data is required.类型错误:传递了稀疏矩阵,但需要密集数据。 Use X.toarray() to convert to a dense numpy array.使用 X.toarray() 转换为密集的 numpy 数组。

The error message itself seems to suggest the solution.错误消息本身似乎暗示了解决方案。 Need to convert both X and y to dense matrices.需要将Xy都转换为稠密矩阵。

Please do the following,请执行以下操作,

X = X.toarray()
y = y.toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

The problem is due to stratify=y .问题是由于stratify=y If you look at the documentation for train_test_split , we can see that如果您查看train_test_split的文档,我们可以看到

*arrays : *arrays

  • Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.允许的输入是列表、numpy 数组、 scipy-sparse 矩阵Pandas数据帧。

stratify : stratify

  • array-like (does not mention sparse matrices)类数组(不提及稀疏矩阵)

Now unfortunately, this dataset doesn't work well with stratify even if it were cast to a dense array:现在不幸的是,即使将其转换为密集数组,该数据集也不能很好地与stratify一起使用:

>>> X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y.toarray(), test_size=0.3)
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将DataFrame或Matrix拆分为值而不是行的训练集和测试集? - How to split a DataFrame or Matrix into train and test sets on value not row? 将稀疏矩阵划分为训练和测试 - Divide a sparse matrix into train and test 如何使用 PyTorch 从一个目录将数据拆分为训练集和测试集? - How to split data into train and test sets from one directory with PyTorch? 在稀疏矩阵上的sklearn train_test_split给出错误的结果 - sklearn train_test_split on scipy sparse matrix gives wrong result 如何拆分时间数据(格式为yyyy-mm-dd hh:mm:ss)以测试和训练集合? - How to split time data (in format of yyyy-mm-dd hh:mm:ss) to test and train sets? 如何从大文件夹中读取图片并将其拆分为训练集、验证集和测试集? - How to read pictures from a big folder and split it into train, validation and test sets? 如何将图像和注释拆分为 object 检测任务的训练、测试和验证集? - How to split the images and annotations into train, test and validation sets for an object detection task? 我如何使用 2 numpy arrays 作为去噪自动编码器的数据集,并将它们进一步拆分为训练集和测试集 - How can i use 2 numpy arrays as dataset for denoising autoencoder, and further split them into train and test sets 如何按月拆分训练和测试 - How to split in train and test by month 如何拆分测试和训练大小 - How to split test and train size
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM