以可移植数据格式保存/加载 scipy 稀疏 csr_matrix

Question

如何以便携式格式保存/加载 scipy 稀疏csr_matrix ？ scipy 稀疏矩阵是在 Python 3（Windows 64 位）上创建的，以在 Python 2（Linux 64 位）上运行。 最初，我使用了 pickle（协议=2 和 fix_imports=True），但是从 Python 3.2.2（Windows 64 位）到 Python 2.7.2（Windows 32 位）这不起作用并得到错误：

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

接下来，尝试了numpy.save和numpy.load以及scipy.io.mmwrite()和scipy.io.mmread() ，但这些方法都scipy.io.mmread() 。

Answer 1

编辑： scipy 0.19 现在有scipy.sparse.save_npz和scipy.sparse.load_npz 。

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

对于这两个函数， file参数也可能是一个类文件对象（即open的结果）而不是文件名。

得到了 Scipy 用户组的回答：

csr_matrix 有 3 个重要的数据属性： .data 、 .indices和.indptr 。 所有都是简单的 ndarrays，所以numpy.save可以处理它们。 保存三个阵列与numpy.save或numpy.savez ，加载它们回来numpy.load ，然后用重新创建稀疏矩阵对象：

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

例如：

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

Answer 2

虽然你写， scipy.io.mmwrite和scipy.io.mmread不适合你，我只想添加它们是如何工作的。 这个问题是否定的。 1 谷歌命中，所以我自己开始使用np.savez和pickle.dump在切换到简单而明显的 scipy 函数之前。 他们为我工作，不应该被那些还没有尝试过的人监督。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Answer 3

以下是使用 Jupyter Notebook 对三个最受好评的答案进行的性能比较。 输入是一个 1M x 100K 随机稀疏矩阵，密度为 0.001，包含 100M 个非零值：

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

（注意格式从 csr 改为 coo）。

`np.savez` / `np.load`

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

`cPickle`

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

注意：cPickle 不适用于非常大的对象（请参阅此答案）。 根据我的经验，它不适用于具有 270M 个非零值的 2.7M x 50k 矩阵。 np.savez解决方案运行良好。

结论

（基于 CSR 矩阵的这个简单测试） cPickle是最快的方法，但它不适用于非常大的矩阵， np.savez只是稍微慢一点，而io.mmwrite慢得多，产生更大的文件并恢复到格式错误。 所以np.savez是这里的赢家。

Answer 4

现在您可以使用scipy.sparse.save_npz ： https : scipy.sparse.save_npz

Answer 5

假设你在两台机器上都有 scipy，你可以只使用pickle 。

但是，请确保在酸洗 numpy 数组时指定二进制协议。 否则你会得到一个巨大的文件。

无论如何，你应该能够做到这一点：

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

然后你可以加载它：

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

Answer 6

从 scipy 0.19.0 开始，您可以通过以下方式保存和加载稀疏矩阵：

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

Answer 7

编辑显然它很简单：

def sparse_matrix_tuples(m):
    yield from m.todok().items()

这将产生一个((i, j), value)元组，它们很容易序列化和反序列化。 不确定它如何与下面的csr_matrix代码在性能方面进行比较，但它绝对更简单。 我将原始答案留在下面，因为我希望它提供信息。

添加我的两分钱：对我来说， npz不可移植，因为我无法使用它轻松地将我的矩阵导出到非 Python 客户端（例如 PostgreSQL——很高兴得到纠正）。 所以我希望得到稀疏矩阵的 CSV 输出（就像你得到它一样你print()稀疏矩阵）。 如何实现这取决于稀疏矩阵的表示。 对于 CSR 矩阵，以下代码会输出 CSV 输出。 您可以适应其他表示。

import numpy as np

def csr_matrix_tuples(m):
    # not using unique will lag on empty elements
    uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
    for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
        for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
            yield (i, j, data)

for i, j, data in csr_matrix_tuples(my_csr_matrix):
    print(i, j, data, sep=',')

根据我的测试，它比当前实现中的save_npz慢大约 2 倍。

Answer 8

这是我用来保存lil_matrix 。

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

我必须说我发现 NumPy 的 np.load(..)非常慢。 这是我目前的解决方案，我觉得运行得更快：

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

Answer 9

这对我有用：

import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)

>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)

诀窍是调用.tolist()将形状 0 对象数组转换为原始对象。

Answer 10

我被要求以简单和通用的格式发送矩阵：

<x,y,value>

我结束了这个：

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

以可移植数据格式保存/加载 scipy 稀疏 csr_matrix

问题描述

10 个解决方案

解决方案1
126 已采纳 2012-01-23 23:33:09

解决方案2
37 2015-03-11 21:55:28

解决方案3
27 2017-02-07 23:06:10

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`

结论

解决方案4
16 2017-04-03 10:36:23

解决方案5
11 2012-01-21 21:17:10

解决方案6
9 2017-04-28 10:22:10

解决方案7
2 2019-04-09 00:09:13

解决方案8
1 2016-12-27 18:31:26

解决方案9
1 2019-08-26 14:34:02

解决方案10
0 2017-01-15 12:45:32

以可移植数据格式保存/加载 scipy 稀疏 csr_matrix

问题描述

10 个解决方案

解决方案1 126 已采纳 2012-01-23 23:33:09

解决方案2 37 2015-03-11 21:55:28

解决方案3 27 2017-02-07 23:06:10

io.mmwrite / io.mmread

np.savez / np.load

cPickle

结论

解决方案4 16 2017-04-03 10:36:23

解决方案5 11 2012-01-21 21:17:10

解决方案6 9 2017-04-28 10:22:10

解决方案7 2 2019-04-09 00:09:13

解决方案8 1 2016-12-27 18:31:26

解决方案9 1 2019-08-26 14:34:02

解决方案10 0 2017-01-15 12:45:32

解决方案1
126 已采纳 2012-01-23 23:33:09

解决方案2
37 2015-03-11 21:55:28

解决方案3
27 2017-02-07 23:06:10

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`

解决方案4
16 2017-04-03 10:36:23

解决方案5
11 2012-01-21 21:17:10

解决方案6
9 2017-04-28 10:22:10

解决方案7
2 2019-04-09 00:09:13

解决方案8
1 2016-12-27 18:31:26

解决方案9
1 2019-08-26 14:34:02

解决方案10
0 2017-01-15 12:45:32