在 Python 中即时在磁盘上构造稀疏矩阵

Question

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M) .我目前正在进行一些内存密集型文本处理，为此我必须构建一个sparse matrix的float32s矩阵，其尺寸为~ (2M, 5M) 。 I'm constructing this matrix column by column when reading a corpus of 5M documents.在阅读 5M 文档的语料库时，我正在逐列构建这个矩阵。 For this purpose I use a sparse dok_matrix data structure from SciPy .为此，我使用了来自SciPy的稀疏dok_matrix数据结构。 However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes.但是，当到达第 500 000 个文档时，我的内存已满（使用了大约 30GB）并且程序崩溃了。 What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn , but, as said, it is impossible to hold and construct the entire matrix in memory.我最终想要做的是使用sklearn对矩阵执行降维算法，但是，如前所述，不可能在内存中保存和构造整个矩阵。 I've looked into numpy.memmap , as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.我已经查看了numpy.memmap ，因为 sklearn 支持这一点，并尝试对 SciPy 稀疏矩阵的一些底层 numpy 数据结构进行memmap ，但我无法成功。

It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space.我不可能以密集格式保存整个矩阵，因为这需要 40TB 的磁盘空间。 So I think that HDF5 and PyTables are no option for me (?).所以我认为HDF5和PyTables不适合我（？）。

My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?我现在的问题是：如何即时构建稀疏矩阵，但直接写入磁盘而不是内存，以便之后可以在 sklearn 中使用它？

Thanks!谢谢！

Answer 1

We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk.我们在处理磁盘上的大型稀疏数据集的单细胞基因组数据领域遇到了类似的问题。 I'll show you a small simple example of how I would deal with this.我将向您展示一个简单的小例子，说明我将如何处理这个问题。 My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once.我的假设是您的内存非常有限，并且可能无法一次将稀疏矩阵的多个副本放入内存中。 This will work even if you can't fit one entire copy.即使您无法容纳一份完整的副本，这也将起作用。

I would construct an on disk sparse CSC matrix column by column.我将逐列构建磁盘上的稀疏 CSC 矩阵。 A sparse csc matrix uses 3 underlying arrays:稀疏 csc 矩阵使用 3 个底层数组：

data : the values stored in the matrix data : 存储在矩阵中的值
indices : the row index for each value in the matrix indices ：矩阵中每个值的行索引
indptr : an array of length n_cols + 1 , which divides indices and data by which column they belong to. indptr ：长度为n_cols + 1的数组，它将indices和data除以它们所属的列。

As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data .作为一个解释性示例，列i的值存储在data的indptr[i]:indptr[i+1]范围内。 Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]] .同样，这些值的行索引可以通过indices[indptr[i]:indptr[i+1]]找到。

To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.为了模拟您的数据生成过程（我假设是解析文档），我将定义一个函数process_document ，它返回相关文档的indices值和data 。

import numpy as np
import h5py
from scipy import sparse

from tqdm import tqdm  # For monitoring the writing process
from typing import Tuple, Union  # Just for argument annotation

def process_document():
    """
    Simulate processing a document. Results in sparse vector represenation.
    """
    n_items = np.random.negative_binomial(2, .0001)
    indices = np.random.choice(2_000_000, n_items, replace=False)
    indices.sort()
    data = np.random.random(n_items).astype(np.float32)
    return indices, data

def data_generator(n):
    """Iterator which yields simulated data."""
    for i in range(n):
        yield process_document()

Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.现在我将在 hdf5 文件中创建一个组，该文件将存储稀疏矩阵的组成数组。

def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
    """
    Create a group in an hdf5 file that can store a CSC sparse matrix.
    """
    g = f.create_group(groupname)
    g.attrs["shape"] = shape
    g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
    g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
    g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
    return g

And finally a function for reading this group as a sparse matrix (this one is pretty simple).最后是一个将这个组读取为稀疏矩阵的函数（这个非常简单）。

def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
    return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])

Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).现在我们将创建磁盘上的稀疏矩阵并一次写入一列（我使用较少的列，因为这可能有点慢）。

N_COLS = 10

def make_disk_matrix(f, groupname, data_iter, shape):
    group = make_sparse_csc_group(f, "mtx", shape)

    indptr = group["indptr"]
    data = group["data"]
    indices = group["indices"]
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices.resize((n_total,))
        data.resize((n_total,))
        indices[n_prev:] = cur_indices
        data[n_prev:] = cur_data
        indptr[doc_num+1] = n_total

# Writing
with h5py.File("data.h5", "w") as f:
    make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))

# Reading
with h5py.File("data.h5", "r") as f:
    mtx = read_sparse_csc_group(f["mtx"])

Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it.这再次考虑了内存非常受限的情况，在这种情况下，您可能无法在创建时将整个稀疏矩阵放入内存中。 A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions).如果您可以处理整个稀疏矩阵加上至少一个副本，那么执行此操作的一种更快的方法是不打扰磁盘存储（类似于其他建议）。 However, using a slight modification of this code should give you better performance:但是，稍微修改一下这段代码应该会给你更好的性能：

def make_memory_mtx(data_iter, shape):
    indices_list = []
    data_list = []
    indptr = np.zeros(shape[1]+1, dtype=int)
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices_list.append(cur_indices)
        data_list.append(cur_data)
        indptr[doc_num+1] = n_total

    indices = np.concatenate(indices_list)
    data = np.concatenate(data_list)

    return sparse.csc_matrix((data, indices, indptr), shape=shape)

mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))

This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays.这应该相当快，因为它只会在您连接数组后复制数据。 Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.其他当前发布的解决方案在您处理时重新分配了数组，制作了许多大型数组的副本。

Answer 2

It would be great if you could provide a minimal working code.如果您可以提供最少的工作代码，那就太好了。 I can't see if your matrix gets too big by construction (1) or just because you have too much data (2).我看不出您的矩阵是否因构造（1）而变得太大，或者仅仅是因为您有太多数据（2）。 If you don't really care about building this matrix yourself, you can directly look at my remark 2.如果不是很在意自己构建这个矩阵，可以直接看我的备注2。

For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk.对于问题 (1)，在下面的示例代码中，我制作了一个包装类来逐块构建 csr_matrix。 The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment.这个想法是只添加列表的 (row,column,data) 元组，直到达到缓冲区限制（参见备注 1），并在此时实际更新矩阵。 When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples.当达到限制时，它将减少内存中的数据，因为 csr_matrix 构造函数添加具有相同（行，列）元组的数据。 This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.这部分只允许你以快速的方式构造稀疏矩阵（比为每一行创建一个稀疏矩阵要快得多），并避免一个单词在文档中出现多次时由于（行，列）的冗余而导致的内存错误.

import numpy as np
import scipy.sparse

class SparseMatrixBuilder():
    def __init__(self, shape, build_size_limit):
        self.sparse_matrix = scipy.sparse.csr_matrix(shape)
        self.shape = shape
        self.build_size_limit = build_size_limit
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []


    def add(self, data, col_indices, row_indices):
        self.data_temp.append(data)
        self.col_indices_temp.append(col_indices)
        self.row_indices_temp.append(row_indices)
        if len(self.data_temp) == self.build_size_limit:
            self.sparse_matrix += scipy.sparse.csr_matrix(
                (np.concatenate(self.data_temp),
                 (np.concatenate(self.col_indices_temp),
                  np.concatenate(self.row_indices_temp))),
                shape=self.shape
            )
            self.data_temp = []
            self.col_indices_temp = []
            self.row_indices_temp = []

    def get_matrix(self):
        self.sparse_matrix += scipy.sparse.csr_matrix(
            (np.concatenate(self.data_temp),
             (np.concatenate(self.col_indices_temp),
              np.concatenate(self.row_indices_temp))),
            shape=self.shape
        )
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []
        return self.sparse_matrix

For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached.对于问题 (2)，您可以通过添加一个 save 方法轻松扩展此类，一旦达到限制（或第二个限制），该方法将矩阵存储在磁盘上。 As such, you'll end up with multiple chunks of sparse matrices on disk.因此，您最终会在磁盘上得到多个稀疏矩阵块。 Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).然后你需要一个可以处理分块矩阵的降维算法（见备注 2）。

remark 1: the buffer limit here is not really well defined.备注 1：这里的缓冲区限制并没有很好地定义。 It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).与机器上可用的 RAM 相比，检查 numpy 数组 data_temp、col_indices_temp 和 row_indices_temp 的实际大小会更好（这很容易用 python 自动化）。

remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models.备注 2： gensim是一个 python 库，具有使用分块文件构建 NLP 模型的巨大优势。 So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.因此，您可以构建一个字典，构建一个稀疏矩阵并使用该库对其进行降维，而无需太多 RAM。

Answer 3

I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO.我假设您的所有数据都可以使用对内存更友好的稀疏矩阵格式（例如 COO）放入内存中。 If it does not, there is almost no hope you will be able to proceed with sklearn , even by using mmap .如果没有，您几乎没有希望继续使用sklearn ，即使使用mmap也是如此。 Indeed sklearn will likely create subsequent objects with memory requirements of the same order of magnitude as your input.实际上sklearn可能会创建后续对象，其内存需求与您的输入相同数量级。

Scipy's dok_matrix are actually a sub-class of the vanilla dict . Scipy 的dok_matrix实际上是 vanilla dict的子类。 They store the data using individual python objects and tons of pointers, so they are not memory efficient.它们使用单独的 python 对象和大量指针存储数据，因此它们的内存效率不高。 The most compact representation is the coo_matrix format.最紧凑的表示是coo_matrix格式。 You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data;您可以通过为坐标（行和列）和数据预先分配数组来增量构建创建 COO 矩阵所需的数据； and eventually increase these buffers if your initial guess was wrong.如果您最初的猜测是错误的，最终会增加这些缓冲区。


def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
    counter = 0
    rows = numpy.empty(n_data_hint, dtype=idx_dtype)
    cols = numpy.empty(n_data_hint, dtype=idx_dtype)
    data = numpy.empty(n_data_hint, dtype=data_dtype)
    for row, col, value in iterable:
        if counter >= n_data_hint:
            n_data_hint *= 2
            rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
        rows[counter] = row
        cols[counter] = col
        data[counter] = value
        counter += 1
    rows = rows[:counter]
    cols = cols[:counter]
    data = data[:counter]
    return coo_matrix((data, (rows, cols)))


def _reallocate(rows, cols, data, n):
    new_rows = numpy.empty(n, dtype=rows.dtype)
    new_cols = numpy.empty(n, dtype=cols.dtype)
    new_data = numpy.empty(n, dtype=data.dtype)
    new_rows[:rows.size] = rows
    new_cols[:cols.size] = cols
    new_data[:data.size] = data
    return new_rows, new_cols, new_data

which you can test with randomly-generated data like this:您可以使用随机生成的数据进行测试，如下所示：

def get_random_data(n, max_row=2000, max_col=5000):
    for _ in range(n):
        row = numpy.random.choice(max_row)
        col = numpy.random.choice(max_col)
        val = numpy.random.randn()
        yield row, col, val

# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)

# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)

Once you have your COO matrix, you may want to convert to CSR using coo.tocsr() .获得 COO 矩阵后，您可能希望使用coo.tocsr()转换为 CSR。 The CSR matrices are more optimized for common operations such as dot product. CSR 矩阵针对点积等常见操作进行了更优化。 It requires a bit more memory in the case where some rows were empty originally.在某些行最初为空的情况下，它需要更多的内存。 This is because it stores pointers for all rows even empty ones.这是因为它存储所有行的指针，甚至是空行。

Answer 4

看这里，最后他解释了如何将稀疏矩阵存储和直接读取到 Hdf5 文件。

在 Python 中即时在磁盘上构造稀疏矩阵

问题描述

4 个解决方案

解决方案1
3 2020-07-13 08:15:13

解决方案2
0 2019-02-26 14:49:49

解决方案3
0 2019-10-24 06:48:08

解决方案4
0 2022-07-10 10:18:07

在 Python 中即时在磁盘上构造稀疏矩阵

问题描述

4 个解决方案

解决方案1 3 2020-07-13 08:15:13

解决方案2 0 2019-02-26 14:49:49

解决方案3 0 2019-10-24 06:48:08

解决方案4 0 2022-07-10 10:18:07

解决方案1
3 2020-07-13 08:15:13

解决方案2
0 2019-02-26 14:49:49

解决方案3
0 2019-10-24 06:48:08

解决方案4
0 2022-07-10 10:18:07