如何在Python中手动创建稀疏矩阵

Question

I have a text file containing data representing a sparse matrix with the following format: 我有一个文本文件，其中包含表示具有以下格式的稀疏矩阵的数据：

0 234 345
0 236 
0 345 365 465
0 12 35 379

The data is used for a classification task and each row can be considered a feature vector. 数据用于分类任务，每行可以被视为特征向量。 The first value in each row represents a label, the values following it represent the presence of individual features. 每行中的第一个值表示标签，其后面的值表示单个要素的存在。

I'm trying to create a sparse matrix with these values (to use in a machine learning task with scikit learn). 我正在尝试使用这些值创建稀疏矩阵（用于scikit学习的机器学习任务）。 I've found and read the scipy.sparse documentation but I'm failing to understand how to incrementally build up a sparse matrix with source data like this. 我已经找到并阅读了scipy.sparse文档，但是我无法理解如何逐步建立一个包含这样的源数据的稀疏矩阵。

The examples I've found so far show how to take a dense matrix and convert it, or how to create a native sparse matrix with contrived data, but no examples that have helped me here. 到目前为止我发现的例子展示了如何采用密集矩阵并对其进行转换，或者如何创建具有人为数据的本地稀疏矩阵，但没有任何例子对我有帮助。 I did find this related SO question ( Building and updating a sparse matrix in python using scipy ), but the example assumes you know the max COL, ROW sizes, which I don't, so that data type doesn't seem appropriate. 我确实找到了这个相关的SO问题（使用scipy在python中构建和更新稀疏矩阵），但是该示例假设您知道最大COL，ROW大小，而我不知道，因此数据类型似乎不合适。

So far I have the following code to read the document and parse the values into something that seems reasonable: 到目前为止，我有以下代码来读取文档并将值解析为看似合理的值：

def get_sparse_matrix():
    matrix = []
    with open("data.dat", 'r') as f:
        for i, line in enumerate(f):
            row = line.strip().split()
            label = row[0]
            features = entry[1:]
            matrix.append([(i, col) for col in features])

    sparse_matrix = #magic happens here

    return sparse_matrix

So questions are, 所以问题是，

What is the appropriate sparse matrix type to use here? 在这里使用什么是适当的稀疏矩阵类型？
Am I heading in the right direction with the code I have? 我是否按照我的代码朝着正确的方向前进？

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Answer 1

You can use coo_matrix() : 你可以使用coo_matrix() ：

import numpy as np
from scipy import sparse
data = """0 234 345
0 236 
0 345 365 465
0 12 35 379"""

column_list = []
for line in data.split("\n"):
    values = [int(x) for x in line.strip().split()[1:]]
    column_list.append(values)
lengths = [len(row) for row in column_list]
cols = np.concatenate(column_list)
rows = np.repeat(np.arange(len(column_list)), lengths)
m = sparse.coo_matrix((np.ones_like(rows), (rows, cols)))

Here is the code to check the result: 这是检查结果的代码：

np.where(m.toarray())

the output: 输出：

(array([0, 0, 1, 2, 2, 2, 3, 3, 3]),
 array([234, 345, 236, 345, 365, 465,  12,  35, 379]))

如何在Python中手动创建稀疏矩阵

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-11-15 03:09:44

如何在Python中手动创建稀疏矩阵

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-11-15 03:09:44

解决方案1
4 已采纳 2014-11-15 03:09:44