简体   繁体   English

保存大型 Scipy 稀疏矩阵

[英]Save Large Scipy Sparse Matrix

I am trying to cPickle a large scipy sparse matrix for later use.我正在尝试 cPickle 一个大的 scipy 稀疏矩阵供以后使用。 I am getting this error:我收到此错误:

  File "tfidf_scikit.py", line 44, in <module>
    pickle.dump([trainID, trainX, trainY], fout, protocol=-1)
SystemError: error return without exception set

trainX is the large sparse matrix, the other two are lists 6mil elements long. trainX是大型稀疏矩阵,另外两个是 600 万个元素长的列表。

In [1]: trainX
Out[1]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
    with 286674296 stored elements in Compressed Sparse Row format>

At this point, Python RAM usage is 4.6GB and I have 16GB of RAM on my laptop.此时,Python RAM 使用量为 4.6GB,而我的笔记本电脑上有 16GB 的 RAM。

I think I'm running into a known memory bug for cPickle where it doesn't work with objects that are too big.我想我遇到了 cPickle 的一个已知内存错误,它不适用于太大的对象。 I tried marshal as well but I don't think it works for scipy matrices.我也尝试过marshal ,但我认为它不适用于 scipy 矩阵。 Can someone offer a solution and preferably an example on how to load and save this?有人可以提供一个解决方案,最好是一个关于如何加载和保存它的例子吗?

Python 2.7.5蟒蛇 2.7.5

Mac OS 10.9 Mac 操作系统 10.9

Thank you.谢谢你。

I had this problem with a multi-gigabyte Numpy matrix (Ubuntu 12.04 with Python 2.7.3 - seems to be this issue: https://github.com/numpy/numpy/issues/2396 ).我遇到了一个多千兆字节的 Numpy 矩阵(带有 Python 2.7.3 的 Ubuntu 12.04 - 似乎是这个问题: https : //github.com/numpy/numpy/issues/2396 )的问题。

I've solved it using numpy.savetxt() / numpy.loadtxt() .我已经使用numpy.savetxt() / numpy.loadtxt()解决了它。 The matrix is compressed adding a .gz file extension when saving.矩阵被压缩,保存时添加 .gz 文件扩展名。

Since I too had just a single matrix I did not investigate the use of HDF5.因为我也只有一个矩阵,所以我没有研究 HDF5 的使用。

Both numpy.savetxt (only for arrays, not sparse matrices) and sklearn.externals.joblib.dump (pickling, slow as hell and blew up memory usage) didn't work for me on Python 2.7. numpy.savetxt (仅适用于数组,不适用于稀疏矩阵)和sklearn.externals.joblib.dump (酸洗,慢得要命,内存使用sklearn.externals.joblib.dump )在 Python 2.7 上对我不起作用。

Instead, I used scipy.sparse.save_npz and it worked just fine.相反,我使用了scipy.sparse.save_npz并且效果很好。 Keep in mind that it only works for csc , csr , bsr , dia or coo matrices.请记住,它仅适用于csccsrbsrdiacoo矩阵。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM