[英]Construct sparse matrix on disk on the fly in Python
I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix
of float32s
with dimensions of ~ (2M, 5M)
.我目前正在进行一些内存密集型文本处理,为此我必须构建一个
sparse matrix
的float32s
矩阵,其尺寸为~ (2M, 5M)
。 I'm constructing this matrix column by column when reading a corpus of 5M documents.在阅读 5M 文档的语料库时,我正在逐列构建这个矩阵。 For this purpose I use a sparse
dok_matrix
data structure from SciPy
.为此,我使用了来自
SciPy
的稀疏dok_matrix
数据结构。 However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes.但是,当到达第 500 000 个文档时,我的内存已满(使用了大约 30GB)并且程序崩溃了。 What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using
sklearn
, but, as said, it is impossible to hold and construct the entire matrix in memory.我最终想要做的是使用
sklearn
对矩阵执行降维算法,但是,如前所述,不可能在内存中保存和构造整个矩阵。 I've looked into numpy.memmap
, as sklearn supports this, and tried to memmap
some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.我已经查看了
numpy.memmap
,因为 sklearn 支持这一点,并尝试对 SciPy 稀疏矩阵的一些底层 numpy 数据结构进行memmap
,但我无法成功。
It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space.我不可能以密集格式保存整个矩阵,因为这需要 40TB 的磁盘空间。 So I think that
HDF5
and PyTables
are no option for me (?).所以我认为
HDF5
和PyTables
不适合我(?)。
My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?我现在的问题是:如何即时构建稀疏矩阵,但直接写入磁盘而不是内存,以便之后可以在 sklearn 中使用它?
Thanks!谢谢!
We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk.我们在处理磁盘上的大型稀疏数据集的单细胞基因组数据领域遇到了类似的问题。 I'll show you a small simple example of how I would deal with this.
我将向您展示一个简单的小例子,说明我将如何处理这个问题。 My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once.
我的假设是您的内存非常有限,并且可能无法一次将稀疏矩阵的多个副本放入内存中。 This will work even if you can't fit one entire copy.
即使您无法容纳一份完整的副本,这也将起作用。
I would construct an on disk sparse CSC matrix column by column.我将逐列构建磁盘上的稀疏 CSC 矩阵。 A sparse csc matrix uses 3 underlying arrays:
稀疏 csc 矩阵使用 3 个底层数组:
data
: the values stored in the matrix data
: 存储在矩阵中的值indices
: the row index for each value in the matrix indices
:矩阵中每个值的行索引indptr
: an array of length n_cols + 1
, which divides indices
and data
by which column they belong to. indptr
:长度为n_cols + 1
的数组,它将indices
和data
除以它们所属的列。 As an explanatory example, the values for column i
are stored in the range indptr[i]:indptr[i+1]
of data
.作为一个解释性示例,列
i
的值存储在data
的indptr[i]:indptr[i+1]
范围内。 Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]]
.同样,这些值的行索引可以通过
indices[indptr[i]:indptr[i+1]]
找到。
To simulate your data generating process (parsing a document, I assume) I'll define a function process_document
which returns the values for indices
and data
for the relevant document.为了模拟您的数据生成过程(我假设是解析文档),我将定义一个函数
process_document
,它返回相关文档的indices
值和data
。
import numpy as np
import h5py
from scipy import sparse
from tqdm import tqdm # For monitoring the writing process
from typing import Tuple, Union # Just for argument annotation
def process_document():
"""
Simulate processing a document. Results in sparse vector represenation.
"""
n_items = np.random.negative_binomial(2, .0001)
indices = np.random.choice(2_000_000, n_items, replace=False)
indices.sort()
data = np.random.random(n_items).astype(np.float32)
return indices, data
def data_generator(n):
"""Iterator which yields simulated data."""
for i in range(n):
yield process_document()
Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.现在我将在 hdf5 文件中创建一个组,该文件将存储稀疏矩阵的组成数组。
def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
"""
Create a group in an hdf5 file that can store a CSC sparse matrix.
"""
g = f.create_group(groupname)
g.attrs["shape"] = shape
g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
return g
And finally a function for reading this group as a sparse matrix (this one is pretty simple).最后是一个将这个组读取为稀疏矩阵的函数(这个非常简单)。
def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])
Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).现在我们将创建磁盘上的稀疏矩阵并一次写入一列(我使用较少的列,因为这可能有点慢)。
N_COLS = 10
def make_disk_matrix(f, groupname, data_iter, shape):
group = make_sparse_csc_group(f, "mtx", shape)
indptr = group["indptr"]
data = group["data"]
indices = group["indices"]
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices.resize((n_total,))
data.resize((n_total,))
indices[n_prev:] = cur_indices
data[n_prev:] = cur_data
indptr[doc_num+1] = n_total
# Writing
with h5py.File("data.h5", "w") as f:
make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))
# Reading
with h5py.File("data.h5", "r") as f:
mtx = read_sparse_csc_group(f["mtx"])
Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it.这再次考虑了内存非常受限的情况,在这种情况下,您可能无法在创建时将整个稀疏矩阵放入内存中。 A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions).
如果您可以处理整个稀疏矩阵加上至少一个副本,那么执行此操作的一种更快的方法是不打扰磁盘存储(类似于其他建议)。 However, using a slight modification of this code should give you better performance:
但是,稍微修改一下这段代码应该会给你更好的性能:
def make_memory_mtx(data_iter, shape):
indices_list = []
data_list = []
indptr = np.zeros(shape[1]+1, dtype=int)
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices_list.append(cur_indices)
data_list.append(cur_data)
indptr[doc_num+1] = n_total
indices = np.concatenate(indices_list)
data = np.concatenate(data_list)
return sparse.csc_matrix((data, indices, indptr), shape=shape)
mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))
This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays.这应该相当快,因为它只会在您连接数组后复制数据。 Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.
其他当前发布的解决方案在您处理时重新分配了数组,制作了许多大型数组的副本。
It would be great if you could provide a minimal working code.如果您可以提供最少的工作代码,那就太好了。 I can't see if your matrix gets too big by construction (1) or just because you have too much data (2).
我看不出您的矩阵是否因构造(1)而变得太大,或者仅仅是因为您有太多数据(2)。 If you don't really care about building this matrix yourself, you can directly look at my remark 2.
如果不是很在意自己构建这个矩阵,可以直接看我的备注2。
For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk.对于问题 (1),在下面的示例代码中,我制作了一个包装类来逐块构建 csr_matrix。 The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment.
这个想法是只添加列表的 (row,column,data) 元组,直到达到缓冲区限制(参见备注 1),并在此时实际更新矩阵。 When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples.
当达到限制时,它将减少内存中的数据,因为 csr_matrix 构造函数添加具有相同(行,列)元组的数据。 This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.
这部分只允许你以快速的方式构造稀疏矩阵(比为每一行创建一个稀疏矩阵要快得多),并避免一个单词在文档中出现多次时由于(行,列)的冗余而导致的内存错误.
import numpy as np
import scipy.sparse
class SparseMatrixBuilder():
def __init__(self, shape, build_size_limit):
self.sparse_matrix = scipy.sparse.csr_matrix(shape)
self.shape = shape
self.build_size_limit = build_size_limit
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def add(self, data, col_indices, row_indices):
self.data_temp.append(data)
self.col_indices_temp.append(col_indices)
self.row_indices_temp.append(row_indices)
if len(self.data_temp) == self.build_size_limit:
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def get_matrix(self):
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
return self.sparse_matrix
For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached.对于问题 (2),您可以通过添加一个 save 方法轻松扩展此类,一旦达到限制(或第二个限制),该方法将矩阵存储在磁盘上。 As such, you'll end up with multiple chunks of sparse matrices on disk.
因此,您最终会在磁盘上得到多个稀疏矩阵块。 Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).
然后你需要一个可以处理分块矩阵的降维算法(见备注 2)。
remark 1: the buffer limit here is not really well defined.备注 1:这里的缓冲区限制并没有很好地定义。 It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).
与机器上可用的 RAM 相比,检查 numpy 数组 data_temp、col_indices_temp 和 row_indices_temp 的实际大小会更好(这很容易用 python 自动化)。
remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models.备注 2: gensim是一个 python 库,具有使用分块文件构建 NLP 模型的巨大优势。 So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.
因此,您可以构建一个字典,构建一个稀疏矩阵并使用该库对其进行降维,而无需太多 RAM。
I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO.我假设您的所有数据都可以使用对内存更友好的稀疏矩阵格式(例如 COO)放入内存中。 If it does not, there is almost no hope you will be able to proceed with
sklearn
, even by using mmap
.如果没有,您几乎没有希望继续使用
sklearn
,即使使用mmap
也是如此。 Indeed sklearn
will likely create subsequent objects with memory requirements of the same order of magnitude as your input.实际上
sklearn
可能会创建后续对象,其内存需求与您的输入相同数量级。
Scipy's dok_matrix
are actually a sub-class of the vanilla dict
. Scipy 的
dok_matrix
实际上是 vanilla dict
的子类。 They store the data using individual python objects and tons of pointers, so they are not memory efficient.它们使用单独的 python 对象和大量指针存储数据,因此它们的内存效率不高。 The most compact representation is the
coo_matrix
format.最紧凑的表示是
coo_matrix
格式。 You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data;您可以通过为坐标(行和列)和数据预先分配数组来增量构建创建 COO 矩阵所需的数据; and eventually increase these buffers if your initial guess was wrong.
如果您最初的猜测是错误的,最终会增加这些缓冲区。
def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
counter = 0
rows = numpy.empty(n_data_hint, dtype=idx_dtype)
cols = numpy.empty(n_data_hint, dtype=idx_dtype)
data = numpy.empty(n_data_hint, dtype=data_dtype)
for row, col, value in iterable:
if counter >= n_data_hint:
n_data_hint *= 2
rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
rows[counter] = row
cols[counter] = col
data[counter] = value
counter += 1
rows = rows[:counter]
cols = cols[:counter]
data = data[:counter]
return coo_matrix((data, (rows, cols)))
def _reallocate(rows, cols, data, n):
new_rows = numpy.empty(n, dtype=rows.dtype)
new_cols = numpy.empty(n, dtype=cols.dtype)
new_data = numpy.empty(n, dtype=data.dtype)
new_rows[:rows.size] = rows
new_cols[:cols.size] = cols
new_data[:data.size] = data
return new_rows, new_cols, new_data
which you can test with randomly-generated data like this:您可以使用随机生成的数据进行测试,如下所示:
def get_random_data(n, max_row=2000, max_col=5000):
for _ in range(n):
row = numpy.random.choice(max_row)
col = numpy.random.choice(max_col)
val = numpy.random.randn()
yield row, col, val
# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)
# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)
Once you have your COO matrix, you may want to convert to CSR using coo.tocsr()
.获得 COO 矩阵后,您可能希望使用
coo.tocsr()
转换为 CSR。 The CSR matrices are more optimized for common operations such as dot product. CSR 矩阵针对点积等常见操作进行了更优化。 It requires a bit more memory in the case where some rows were empty originally.
在某些行最初为空的情况下,它需要更多的内存。 This is because it stores pointers for all rows even empty ones.
这是因为它存储所有行的指针,甚至是空行。
看这里,最后他解释了如何将稀疏矩阵存储和直接读取到 Hdf5 文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.