Python中的巨大稀疏矩阵

Question

I need to iteratively construct a huge sparse matrix in numpy/scipy. 我需要在numpy / scipy中迭代构造一个巨大的稀疏矩阵。 The intitialization is done within a loop: 初始化是在一个循环中完成的：

from scipy.sparse import dok_matrix, csr_matrix

def foo(*args):
    dim_x = 256*256*1024
    dim_y = 128*128*512
    matrix = dok_matrix((dim_x, dim_y))    

    for i in range(dim_x):
        # compute stuff in order to get j
        matrix[i, j] = 1.
    return matrix.tocsr()

Then i need to convert it to a csr_matrix, because of further computations like: 然后由于进一步的计算，我需要将其转换为csr_matrix：

matrix = foo(...)
result = matrix.T.dot(x)

At the beginning this was working fine. 刚开始时，它运行良好。 But my matrices are getting bigger and bigger and my computer starts to crash. 但是我的矩阵越来越大，计算机开始崩溃。 Is there a more elegant way in storing the matrix? 有没有更优雅的方式来存储矩阵？

Basically i have the following requirements: 基本上我有以下要求：

The matrix needs to store float values form 0. to 1. 矩阵需要存储从0到1的浮点值。
I need to compute the transpose of the matrix 我需要计算矩阵的转置
I need to compute the dot product with a x_dimensional vector 我需要使用x_Dimension向量计算点积
The matrix dimensions can be around 1*10^9 x 1*10^8 矩阵尺寸约为1 * 10 ^ 9 x 1 * 10 ^ 8

My ram-storage is exceeding. 我的内存存储量超出了。 I was reading several posts on stack overflow and the rest of the internet ;) I found PyTables, which isn't really made for matrix computations... etc.. Is there a better way? 我正在阅读有关堆栈溢出和互联网其余部分的几篇文章；）我发现了PyTables，它不是真正用于矩阵计算的……等等。还有更好的方法吗？

Answer 1

You may have hit the limits of what Python can do for you, or you may be able to do a little more. 您可能已经达到了Python可以为您做的极限，或者您可以做更多的事情。 Try setting a datatype of np.float32 , if you're on a 64 bit machine, this reduced precision may reduce your memory consumption. 尝试将数据类型设置为np.float32 ，如果您使用的是64位计算机，则降低的精度可能会减少内存消耗。 np.float16 may help you on memory even further, but your calculations may slow down (I've seen examples where processing may take 10x the amount of time): np.float16可能会进一步帮助您提高内存np.float16 ，但是您的计算速度可能会变慢（我看过一些示例，其中处理可能要花费10倍的时间）：

    matrix = dok_matrix((dim_x, dim_y), dtype=np.float32)

or possibly much slower, but even less memory consumption: 或可能更慢，但更少的内存消耗：

    matrix = dok_matrix((dim_x, dim_y), dtype=np.float16)

Another option: buy more system memory. 另一种选择：购买更多的系统内存。

Finally, if you can avoid creating your matrix with dok_matrix , and can create it instead with csr_matrix (I don't know if this is possible for your calculations) you may save a little overhead on the dict that dok_matrix uses. 最后，如果你能避免产生与矩阵dok_matrix ，并能与而是创建它csr_matrix （我不知道这是可能你的计算），你可能节省的字典一个小的开销dok_matrix用途。

Answer 2

For your case I would recommend using the data type np.int8 (or np.uint8 ) which require only one byte per element: 对于您的情况，我建议使用数据类型np.int8 （或np.uint8 ），每个元素只需要一个字节：

matrix = dok_matrix((dim_x, dim_y), dtype=np.int8)

Directly constructing the csr_matrix will also allow you to go further with the maximum matrix size: 直接构造csr_matrix还将使您进一步了解最大矩阵大小：

from scipy.sparse import csr_matrix

def foo(*args):
    dim_x = 256*256*1024
    dim_y = 128*128*512
    row = []
    col = []

    for i in range(dim_x):
        # compute stuff in order to get j
        row.append(i)
        col.append(j)
    data = np.ones_like(row, dtype=np.int8)

    return csr_matrix((data, (row, col)), shape=(dim_x, dim_y), dtype=np.int8)

Python中的巨大稀疏矩阵

问题描述

2 个解决方案

解决方案1
2 2014-10-07 03:34:54

解决方案2
2 2014-10-07 06:19:40

Python中的巨大稀疏矩阵

问题描述

2 个解决方案

解决方案1 2 2014-10-07 03:34:54

解决方案2 2 2014-10-07 06:19:40

解决方案1
2 2014-10-07 03:34:54

解决方案2
2 2014-10-07 06:19:40