简体   繁体   English

以可移植数据格式保存/加载 scipy 稀疏 csr_matrix

[英]Save / load scipy sparse csr_matrix in portable data format

How do you save/load a scipy sparse csr_matrix in a portable format?如何以便携式格式保存/加载 scipy 稀疏csr_matrix The scipy sparse matrix is created on Python 3 (Windows 64-bit) to run on Python 2 (Linux 64-bit). scipy 稀疏矩阵是在 Python 3(Windows 64 位)上创建的,以在 Python 2(Linux 64 位)上运行。 Initially, I used pickle (with protocol=2 and fix_imports=True) but this didn't work going from Python 3.2.2 (Windows 64-bit) to Python 2.7.2 (Windows 32-bit) and got the error:最初,我使用了 pickle(协议=2 和 fix_imports=True),但是从 Python 3.2.2(Windows 64 位)到 Python 2.7.2(Windows 32 位)这不起作用并得到错误:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

Next, tried numpy.save and numpy.load as well as scipy.io.mmwrite() and scipy.io.mmread() and none of these methods worked either.接下来,尝试了numpy.savenumpy.load以及scipy.io.mmwrite()scipy.io.mmread() ,但这些方法都scipy.io.mmread()

edit: scipy 0.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz .编辑: scipy 0.19 现在有scipy.sparse.save_npzscipy.sparse.load_npz

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

For both functions, the file argument may also be a file-like object (ie the result of open ) instead of a filename.对于这两个函数, file参数也可能是一个类文件对象(即open的结果)而不是文件名。


Got an answer from the Scipy user group:得到了 Scipy 用户组的回答:

A csr_matrix has 3 data attributes that matter: .data , .indices , and .indptr . csr_matrix 有 3 个重要的数据属性: .data.indices.indptr All are simple ndarrays, so numpy.save will work on them.所有都是简单的 ndarrays,所以numpy.save可以处理它们。 Save the three arrays with numpy.save or numpy.savez , load them back with numpy.load , and then recreate the sparse matrix object with:保存三个阵列与numpy.savenumpy.savez ,加载它们回来numpy.load ,然后用重新创建稀疏矩阵对象:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

So for example:例如:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

Though you write, scipy.io.mmwrite and scipy.io.mmread don't work for you, I just want to add how they work.虽然你写, scipy.io.mmwritescipy.io.mmread不适合你,我只想添加它们是如何工作的。 This question is the no.这个问题是否定的。 1 Google hit, so I myself started with np.savez and pickle.dump before switching to the simple and obvious scipy-functions. 1 谷歌命中,所以我自己开始使用np.savezpickle.dump在切换到简单而明显的 scipy 函数之前。 They work for me and shouldn't be overseen by those who didn't tried them yet.他们为我工作,不应该被那些还没有尝试过的人监督。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Here is performance comparison of the three most upvoted answers using Jupyter notebook.以下是使用 Jupyter Notebook 对三个最受好评的答案进行的性能比较。 The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:输入是一个 1M x 100K 随机稀疏矩阵,密度为 0.001,包含 100M 个非零值:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

io.mmwrite / io.mmread io.mmwrite / io.mmread

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(note that the format has been changed from csr to coo). (注意格式从 csr 改为 coo)。

np.savez / np.load np.savez / np.load

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

cPickle

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Note : cPickle does not work with very large objects (see this answer ).注意:cPickle 不适用于非常大的对象(请参阅此答案)。 In my experience, it didn't work for a 2.7M x 50k matrix with 270M non-zero values.根据我的经验,它不适用于具有 270M 个非零值的 2.7M x 50k 矩阵。 np.savez solution worked well. np.savez解决方案运行良好。

Conclusion结论

(based on this simple test for CSR matrices) cPickle is the fastest method, but it doesn't work with very large matrices, np.savez is only slightly slower, while io.mmwrite is much slower, produces bigger file and restores to the wrong format. (基于 CSR 矩阵的这个简单测试) cPickle是最快的方法,但它不适用于非常大的矩阵, np.savez只是稍微慢一点,而io.mmwrite慢得多,产生更大的文件并恢复到格式错误。 So np.savez is the winner here.所以np.savez是这里的赢家。

现在您可以使用scipy.sparse.save_npzhttps : scipy.sparse.save_npz

Assuming you have scipy on both machines, you can just use pickle .假设你在两台机器上都有 scipy,你可以只使用pickle

However, be sure to specify a binary protocol when pickling numpy arrays.但是,请确保在酸洗 numpy 数组时指定二进制协议。 Otherwise you'll wind up with a huge file.否则你会得到一个巨大的文件。

At any rate, you should be able to do this:无论如何,你应该能够做到这一点:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

You can then load it with:然后你可以加载它:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

As of scipy 0.19.0, you can save and load sparse matrices this way:从 scipy 0.19.0 开始,您可以通过以下方式保存和加载稀疏矩阵:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

EDIT Apparently it is simple enough to:编辑显然它很简单:

def sparse_matrix_tuples(m):
    yield from m.todok().items()

Which will yield a ((i, j), value) tuples, which are easy to serialize and deserialize.这将产生一个((i, j), value)元组,它们很容易序列化和反序列化。 Not sure how it compares performance-wise with the code below for csr_matrix , but it's definitely simpler.不确定它如何与下面的csr_matrix代码在性能方面进行比较,但它绝对更简单。 I'm leaving the original answer below as I hope it's informative.我将原始答案留在下面,因为我希望它提供信息。


Adding my two cents: for me, npz is not portable as I can't use it to export my matrix easily to non-Python clients (eg PostgreSQL -- glad to be corrected).添加我的两分钱:对我来说, npz不可移植,因为我无法使用它轻松地将我的矩阵导出到非 Python 客户端(例如 PostgreSQL——很高兴得到纠正)。 So I would have liked to get CSV output for the sparse matrix (much like you would get it you print() the sparse matrix).所以我希望得到稀疏矩阵的 CSV 输出(就像你得到它一样你print()稀疏矩阵)。 How to achieve this depends on the representation of the sparse matrix.如何实现这取决于稀疏矩阵的表示。 For a CSR matrix, the following code spits out CSV output.对于 CSR 矩阵,以下代码会输出 CSV 输出。 You can adapt for other representations.您可以适应其他表示。

import numpy as np

def csr_matrix_tuples(m):
    # not using unique will lag on empty elements
    uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
    for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
        for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
            yield (i, j, data)

for i, j, data in csr_matrix_tuples(my_csr_matrix):
    print(i, j, data, sep=',')

It's about 2 times slower than save_npz in the current implementation, from what I've tested.根据我的测试,它比当前实现中的save_npz慢大约 2 倍。

This is what I used to save a lil_matrix .这是我用来保存lil_matrix

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

I must say I found NumPy's np.load(..) to be very slow .我必须说我发现 NumPy 的 np.load(..)非常慢 This is my current solution, I feel runs much faster:这是我目前的解决方案,我觉得运行得更快:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

This works for me:这对我有用:

import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)

>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)

The trick was to call .tolist() to convert the shape 0 object array to the original object.诀窍是调用.tolist()将形状 0 对象数组转换为原始对象。

I was asked to send the matrix in a simple and generic format:我被要求以简单和通用的格式发送矩阵:

<x,y,value>

I ended up with this:我结束了这个:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM