简体   繁体   English

Python:如何将2d numpy数组有效地保存到磁盘?

[英]Python: How to save 2d numpy array efficiently to disk?

I have a huge 2d numpy array that's suppose to work as a co-occurrence matrix. 我有一个巨大的2d numpy数组,假设它可以作为共现矩阵工作。 I've tried to use scipy.sparse as my data structure, but dok_matrix indexing is incredibly slow (4x times slower). 我尝试使用scipy.sparse作为数据结构,但是dok_matrix索引的速度非常慢(慢4倍)。

# Impossible
import numpy
N = 1000000 (1 milion)
coo = np.zeros((N, N), dtype=np.uint32)

I want to persist this array. 我想坚持这个数组。

After searching for ways to save it I tried to use PyTables or hd5py , but I couldn't find a way to save it without running out of memory. 在寻找保存方式之后,我尝试使用PyTableshd5py ,但是在没有内存用完的情况下,我找不到一种保存方式。

with open(name, 'w') as _file:
   np.save(_file, coo)

For instance, using PyTables : 例如,使用PyTables

    import tables
    _file = tables.openFile(
                name,
                mode='w',
                title='Co-occurrence matrix')
    atom = tables.Atom.from_dtype(coo.dtype)
    _filters = tables.Filters(complib='blosc', complevel=5)
    ds = _file.createEArray(
            _file.root,
            'coo_matrix',
            atom,
            shape=(0, coo.shape[-1]),
            expectedrows=coo.shape[-1],
            filters=_filters)
    # ds[:] = coo => not an option
    for _index, _data in enumerate(coo):
        ds.append(coo[_index][np.newaxis,:])
    _file.close()

And using hd5py : 并使用hd5py

import h5py
h5f = h5py.File(name, 'w')
h5f.create_dataset('dataset_1', data=coo)

Both methods keep increasing memory usage until I have to kill the process. 两种方法都会增加内存使用率,直到我不得不终止进程为止。 So, is there any way to do it incrementally? 那么,有什么办法可以做到这一点? If it's not possible to do it can you recommend another way for persisting this matrix? 如果无法做到这一点,您可以推荐另一种持久化此矩阵的方法吗?

EDIT 编辑

I'm creating this co-occurrence matrix like this: 我正在创建此共现矩阵,如下所示:

    coo = np.zeros((N, N), dtype=np.uint32)
    for doc_id, doc in enumerate(self.w.get_docs()):
        for w1, w2 in combinations(doc, 2):
                if w1 != w2:
                    coo[w1, w2] += 1

I want to save coo (2d numpy array) to retrieve it from disk later and find co-occurrence values, like: coo[w1, w2] 我想保存coo(二维numpy数组)以便以后从磁盘检索它并找到共现值,例如:coo [w1,w2]

np.save is a fast, efficient way of saving a dense array. np.save是一种快速有效的保存密集数组的方法。 All it does is write a small header and then the data buffer of the array. 它所做的只是写一个小头文件,然后写数组的数据缓冲区。

But for a large array, that data buffer will have N*N*4 (for your dtype) bytes - in one contiguous memory block. 但是对于大型数组,该数据缓冲区将在一个连续的内存块中具有N*N*4 (对于您的dtype)字节。 That design is also good for element access - the code knows exactly where the i,j element is located. 这种设计也有利于元素访问-代码确切地知道i,j元素的位置。

Beware that np.zeros((N,N)) does not allocate all the necessary memory at once. 注意np.zeros((N,N))不会一次分配所有必需的内存。 Memory use may grow during use (including saving) 使用期间可能会增加内存使用量(包括保存)

np.savez does not help with data storage. np.savez不利于数据存储。 It does a save for each variable, and collects the resulting files in a zip archive (which may also be compressed). 它为每个变量进行save ,并将结果文件收集在zip归档文件中(也可以压缩)。

Tables and h5py can save and load chunks, but that doesn't help if you have to have to whole array in memeory at some point - for creation or use. 表和h5py可以保存和加载块,但是如果您必须在某个时刻将整个数组存储在内存中(供创建或使用),这无济于事。

Since your array will be very sparse, a scipy sparse matrix could save on memory, since it only stores the nonzero elements. 由于数组将非常稀疏, scipy稀疏矩阵可以节省内存,因为它仅存储非零元素。 But it has to also store that element's coordinates as well, so storage per nonzero element isn't as compact. 但是它还必须存储该元素的坐标,因此每个非零元素的存储并不那么紧凑。 There are a number of formats, each with its pros and cons. 有多种格式,每种格式各有利弊。

dok uses a Python dictionary to store data, with keys of the form (i,j) . dok使用Python字典存储数据,其键的形式为(i,j) It is one of the better formats for incrementally building a sparse matrix. 它是用于逐步构建稀疏矩阵的较好格式之一。 I found in in other SO questions that element access with a dok is slower than with a plain dictionary. 我在其他SO问题中发现,使用dok元素访问比使用普通词典要慢。 It is faster to build a regular dictionary, and then update the dok . 建立常规词典然后update dok

lil is another good format for incremental builds. lil是增量构建的另一种很好的格式。 It stores the data in 2 lists of lists. 它将数据存储在2个列表列表中。

coo is convenient for building a matrix, once you have a full set of i,j,data arrays. 一旦拥有完整的i,j,data数组i,j,data coo方便地构建矩阵。

csr and csc are good for computation (esp. linear algebra kinds), and for element access. csrcsc适用于计算(尤其是线性代数类型)和元素访问。 But no good for changing sparsity (adding nonzero elements). 但是改变稀疏性(添加非零元素)没有好处。

But you can build a matrix in one format, and readily convert it to another for use, or storage. 但是您可以以一种格式构建矩阵,然后将其轻松转换为另一种格式以供使用或存储。

There have been SO questions about storing sparse matrices. 关于存储稀疏矩阵存在一些问题。 The easiest is with the MATLAB compatible .mat format ( csc for sparse). 最简单的是与MATLAB兼容的.mat格式(稀疏使用csc )。 To use np.save you need to save the underlying arrays (for coo , csc , csr formats). 要使用np.save您需要保存基础数组(用于coocsccsr格式)。 Python pickle has to be used to save dok or lil . 必须使用Python泡菜来保存doklil

Do a search on [scipy] large sparse to see other SO questions about this kind of matrix. [scipy] large sparse进行搜索,以查看有关此类矩阵的其他SO问题。 You aren't the first to use numpy/scipy for co-occurance calculations of documents (it's one of the 3 main uses of scipy sparse, the others being linear algebra and machine learning). 您不是第一个使用numpy / scipy进行文档共现计算的人(这是scipy稀疏的3个主要用途之一,其他用途是线性代数和机器学习)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM