简体   繁体   中英

Python: how do you store a sparse matrix using python?

I have got an output using sparse matrix in python, i need to store this sparse matrix in my hard disk, how can i do it? if i should create a database then how should i do?? this is my code:

import nltk
import cPickle
import numpy
from scipy.sparse import lil_matrix
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
f = open('spmatrix.pkl','wb')
def markov(L):
    count=0
    c=len(text1)
    for i in range(0,c-2):
        h=L.index(text1[i])
        k=L.index(text1[i+1])
        mat[h,k]=mat[h,k]+1//matrix
    cPickle.dump(mat,f,-1)



text = [w for g in brown.categories() for w in brown.words(categories=g)]
text1=text[1:500]
arr=set(text1)
arr=list(arr)
mat=lil_matrix((len(arr),len(arr)))
markov(arr)
f.close()

I need to store this "mat" in a file and should access the value of the matrix using the co-ordinates..

result of the sparse matrix is like this: `the result of sparse matrix are like this:

(173, 168) 2.0 (173, 169) 1.0 (173, 172) 1.0 (173, 237) 4.0 (174, 231) 1.0 (175, 141) 1.0 (176, 195) 1.0 

but when i store it into a file and read the same i'm getting it like this:

(0, 68) 1.0 (0, 77) 1.0 (0, 95) 1.0 (0, 100)    1.0 (0, 103)    1.0 (0, 110) 1.0 (0, 112)   2.0 (0, 132)    1.0 (0, 133)    2.0 (0, 139)    1.0 (0, 146)    2.0 (0, 156)    1.0 (0, 157)    1.0 (0, 185)    1.0

Assuming you have a numpy matrix or ndarray , which your question and tags imply, there is a dump method and load function you can use:

your_matrix.dump('output.mat')
another_matrix = numpy.load('output.mat')

Note : This answer is in response to the revise question that now provides code.

You should not call cPickle.dump() in your function. Create the sparse matrix and then dump its contents to the file.

Try:

def markov(L):
   count=0
   c=len(text1)
   for i in range(0,c-2):
       h=L.index(text1[i])
       k=L.index(text1[i+1])
       mat[h,k]=mat[h,k]+1 #matrix


text = [w for g in brown.categories() for w in brown.words(categories=g)]
text1=text[1:500]
arr=set(text1)
arr=list(arr)
mat=lil_matrix((len(arr),len(arr)))
markov(arr)
f = open('spmatrix.pkl','wb')
cPickle.dump(mat,f,-1)
f.close()

For me, using the -1 option in cPickle.dump function caused the pickled file to not be loadable afterwards.

The object I dumped through cPickle was an instance of scipy.sparse.dok_matrix .

Using only two arguments did the trick for me; documentation about pickle.dump() states the default value of the protocol parameter is 0 .

Working on Windows 7, Python 2.7.2 (64 bits), and cPickle v 1.71.

Example:

>>> import cPickle
>>> print cPickle.__version__
1.71
>>> from scipy import sparse
>>> H = sparse.dok_matrix((135, 654), dtype='int32')
>>> H[33, 44] = 8
>>> H[123, 321] = -99
>>> print str(H)
  (123, 321)    -99
  (33, 44)  8
>>> fname = 'dok_matrix.pkl'
>>> f = open(fname, mode="wb")
>>> cPickle.dump(H, f)
>>> f.close()
>>> f = open(fname, mode="rb")
>>> M = cPickle.load(f)
>>> f.close()
>>> print str(M)
  (123, 321)    -99
  (33, 44)  8
>>> M == H
True
>>> 

pyTables is the Python interface to HDF5 data model and is pretty popular choice for and well-integrated with NumPy and SciPy. pyTables will let you access slices of databased arrays without needing to load the entire array back into memory.

I don't have any specific experience with sparse matrices per se and a quick Google search neither confirmed nor denied that sparse matrices are supported.

Adding on the HDF5 support, Python also has NetCDF support which is ideal for matrix form data storage and quick access both sparse and dense. It is included in Python-x,y for windows, which a lot of scientific users of python end up with.

More numpy based examples can be found in this cookbook .

For very big sparse matrices on clusters, you might use pytrilinos, it has a HDF5 interface which can dump a sparse matrix to disk, and works also if the matrix is distributed on different nodes.

http://trilinos.sandia.gov/packages/pytrilinos/development/EpetraExt.html#input-output-classes

Depending on the size of the sparse matrix, I tend to just use cPickle to pickle the array:

import cPickle
f = open('spmatrix.pkl','wb')
cPickle.dump(your_matrix,f,-1)
f.close()

If I'm dealing with really large datasets then I tend to use netcdf4-python

Edit:

To then access the file again you would:

f = open('spmatrix.pkl','rb') # open the file in read binary mode
# load the data in the .pkl file into a new variable spmat
spmat = cPickle.load(f) 
f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM