对称稀疏矩阵的有效切片

Question

I have a list of sparse symmetric matrices sigma such that 我有一个稀疏对称矩阵sigma列表

len(sigma) = N

and for all i,j,k , 对于所有i,j,k ，

sigma[i].shape[0] == sigma[i].shape[1] = m  # Square
sigma[i][j,k] == sigma[i][k,j]  # Symmetric

I have an indexing array P such that 我有一个索引数组P这样

P.shape[0] = N
P.shape[1] = k

My objective is to extract the kxk dense submatrices of sigma[i] using the indexing given by P[i,:] . 我的目标是使用P[i,:]给出的索引来提取sigma[i]的kxk密集子矩阵。 This can be done as follows 这可以如下完成

sub_matrices = np.empty([N,k,k])
for i in range(N):
    sub_matrices[i,:,:] = sigma[i][np.ix_(P[i,:], P[i,:])].todense()

Note however that while k is small, N (and m ) are very large. 但是请注意，虽然k很小，但N （和m ）非常大。 If the sparse symmetric matrices are stored in CSR format this takes a very long time. 如果稀疏对称矩阵以CSR格式存储，则需要很长时间。 I feel there must be a better solution. 我觉得必须有一个更好的解决方案。 For example is there a sparse format that lends itself well to symmetric matrices that need to be sliced on both dimensions? 例如，是否存在稀疏格式，适用于需要在两个维度上切片的对称矩阵？

I am using Python but would be open to any C library suggestions that I could interface using Cython. 我正在使用Python，但对任何可以使用Cython进行交互的C库建议都是开放的。

EXTRA 额外

Note that my current Cython approach is as follows: 请注意，我目前的Cython方法如下：

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False) # turn off bounds-checking for entire function
cpdef sparse_slice_fast_cy(sigma,
                           long[:,:] P,
                           double[:,:,:] sub_matrices):
    """
    Inputs:
        sigma: A list (N,) of sparse sp.csr_matrix (m x m)
        P: A 2D array of integers (N, k)
        sub_matrices: A 3D array of doubles (N, k, k) containing the slicing
    """
    # Create variables for keeping code tidy
    cdef long N = P.shape[0]
    cdef long k = P.shape[1]

    cdef long i
    cdef long j
    cdef long index_pointer 
    cdef long sparse_row_pointer

    # Create objects for holding sparse matrix data
    cdef double[:] data
    cdef long[:] indices
    cdef long[:] indptr

    # Object for the ordered P
    cdef long[:] perm

    # Make sure sub_matrices is all 0
    sub_matrices[:] = 0

    for i in range(N):
        # Sort the P
        perm = np.argsort(P[i,:])

        # Get the sparse matrix values
        data     = sigma[i].data
        indices  = sigma[i].indices.astype(long)
        indptr   = sigma[i].indptr.astype(long)

        for j in range(k):
            # Loop over row P[i, perm[j]] in sigma searching for values
            # in P[i, :] vector i.e. compare
            #     sigma[P[i, perm[j], :]
            # against
            #     P[i,:]

            # To do this we need our sparse row vector with columns 
            #     indices[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # and data/values
            #     data[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # which comes from the csr matrix format.
            # We also need our sorted indexing vector
            #     P[i, perm[:]]

            # We begin by pointing at the top of both
            # our vectors and gradually move down them. In the event of 
            # an equality we add the data to sub_matrices[i,:,:] and 
            # increment the INDEXING VECTOR pointer, not the sparse
            # row vector pointer, as there can be multiple values that 
            # are the same in the indexing vector but not the sparse row
            # column vector (only 1 column can appear in 1 row!).
            index_pointer = 0
            sparse_row_pointer = indptr[P[i, perm[j]]]

            while ((index_pointer < k) and (sparse_row_pointer < indptr[P[i, perm[j]] + 1])):
                if indices[sparse_row_pointer] == P[i, perm[index_pointer]]:
                    # We can add data to sub_matrices
                    sub_matrices[i, perm[j], perm[index_pointer]] = \
                           data[sparse_row_pointer]

                    # Only increment the index pointer
                    index_pointer += 1
                elif indices[sparse_row_pointer] > P[i, perm[index_pointer]]:
                    # Need to increment index pointer
                    index_pointer += 1
                else:
                    # Need to increment sparse row pointer
                    sparse_row_pointer += 1

I believe then np.argsort may be inefficient when called often on relatively small vectors and would like to swap for a C implementation. 我相信当经常在相对较小的向量上调用时， np.argsort可能效率低，并且想要交换C实现。 I also don't take advantage of parallel processing that could potentially speed it up over the N sparse matrices. 我也没有利用可能在N稀疏矩阵上加速的并行处理。 Unfortunately as there are Python coercions inside the outer, loop I don't know how I can use prange . 不幸的是，因为外部循环中存在Python强制，我不知道如何使用prange 。

Another point to note is that the Cython approach seems to use a HUGE amount of memory but I have no idea where its getting allocated. 另一点需要注意的是，Cython方法似乎使用了大量的内存，但我不知道它的分配位置。

Latest Version 最新版本

As per the suggestions of ead, below is the latest version of the Cython code 根据ead的建议，下面是Cython代码的最新版本

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False) # turn off bounds-checking for entire function
cpdef sparse_slice_fast_cy(sigma,
                           np.ndarray[np.int32_t, ndim=2] P,
                           np.float64_t[:,:,:] sub_matrices,
                           int symmetric):
    """
    Inputs:
        sigma: A list (N,) of sparse sp.csr_matrix (m x m)
        P: A 2D array of integers (N, k)
        sub_matrices: A 3D array of doubles (N, k, k) containing the slicing
        symmetric: 1 if the sigma matrices are symmetric
    """
    # Create variables for keeping code tidy
    cdef np.int32_t N = P.shape[0]
    cdef np.int32_t k = P.shape[1]

    cdef np.int32_t i
    cdef np.int32_t j
    cdef np.int32_t index_pointer 
    cdef np.int32_t sparse_row_pointer

    # Create objects for holding sparse matrix data
    cdef np.float64_t[:] data
    cdef np.int32_t[:] indices

    cdef np.int32_t[:] indptr

    # Object for the ordered P
    cdef np.int32_t[:,:] perm = np.argsort(P, axis=1).astype(np.int32)

    # Make sure sub_matrices is all 0
    sub_matrices[:] = 0

    for i in range(N):
        # Get the sparse matrix values
        data     = sigma[i].data
        indices  = sigma[i].indices
        indptr   = sigma[i].indptr

        for j in range(k):
            # Loop over row P[i, perm[j]] in sigma searching for values
            # in P[i, :] vector i.e. compare
            #     sigma[P[i, perm[j], :]
            # against
            #     P[i,:]

            # To do this we need our sparse row vector with columns 
            #     indices[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # and data/values
            #     data[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # which comes from the csr matrix format.
            # We also need our sorted indexing vector
            #     P[i, perm[:]]

            # We begin by pointing at the top of both
            # our vectors and gradually move down them. In the event of 
            # an equality we add the data to sub_matrices[i,:,:] and 
            # increment the INDEXING VECTOR pointer, not the sparse
            # row vector pointer, as there can be multiple values that 
            # are the same in the indexing vector but not the sparse row
            # column vector (only 1 column can appear in 1 row!).

            if symmetric:
                index_pointer = j  # Only search upper triangular
            else:
                index_pointer = 0
            sparse_row_pointer = indptr[P[i, perm[i, j]]]

            while ((index_pointer < k) and (sparse_row_pointer < indptr[P[i, perm[i, j]] + 1])):
                if indices[sparse_row_pointer] == P[i, perm[i, index_pointer]]:
                    # We can add data to sub_matrices
                    sub_matrices[i, perm[i, j], perm[i, index_pointer]] = \
                           data[sparse_row_pointer]

                    if symmetric:
                        sub_matrices[i, perm[i, index_pointer], perm[i, j]] = \
                               data[sparse_row_pointer]

                    # Only increment the index pointer
                    index_pointer += 1
                elif indices[sparse_row_pointer] > P[i, perm[i, index_pointer]]:
                    # Need to increment index pointer
                    index_pointer += 1
                else:
                    # Need to increment sparse row pointer
                    sparse_row_pointer += 1

Parallel Version 并行版本

Below is a parallel version, although it doesn't seem to provide any speedup and the code is no longer as nice looking: 下面是一个并行版本，虽然它似乎没有提供任何加速，代码不再是漂亮的样子：

# See https://stackoverflow.com/questions/48805636/efficient-slicing-of-symmetric-sparse-matrices
cimport cython
import numpy as np
cimport numpy as np
from libc.stdlib cimport malloc, free
from cython.parallel import prange

@cython.boundscheck(False) # turn off bounds-checking for entire function
cpdef sparse_slice_fast_cy(sigma,
                           np.ndarray[np.int32_t, ndim=2] P,
                           np.float64_t[:,:,:] sub_matrices,
                           int symmetric):
    """
    Inputs:
        sigma: A list (N,) of sparse sp.csr_matrix (m x m)
        P: A 2D array of integers (N, k)
        sub_matrices: A 3D array of doubles (N, k, k) containing the slicing
        symmetric: 1 if the sigma matrices are symmetric
    """
    # Create variables for keeping code tidy
    cdef np.int32_t N = P.shape[0]
    cdef np.int32_t k = P.shape[1]

    cdef np.int32_t i
    cdef np.int32_t j
    cdef np.int32_t index_pointer 
    cdef np.int32_t sparse_row_pointer

    # Create objects for holding sparse matrix data
    cdef np.float64_t[:] data_mem_view
    cdef np.int32_t[:] indices_mem_view
    cdef np.int32_t[:] indptr_mem_view

    cdef np.float64_t **data = <np.float64_t **> malloc(N * sizeof(np.float64_t *))
    cdef np.int32_t **indices = <np.int32_t **> malloc(N * sizeof(np.int32_t *))
    cdef np.int32_t **indptr = <np.int32_t **> malloc(N * sizeof(np.int32_t *))

    for i in range(N):
        data_mem_view = sigma[i].data
        data[i] = &(data_mem_view[0])

        indices_mem_view = sigma[i].indices
        indices[i] = &(indices_mem_view[0])

        indptr_mem_view = sigma[i].indptr
        indptr[i] = &(indptr_mem_view[0])

    # Object for the ordered P
    cdef np.int32_t[:,:] perm = np.argsort(P, axis=1).astype(np.int32)

    # Make sure sub_matrices is all 0
    sub_matrices[:] = 0

    for i in prange(N, nogil=True):
        for j in range(k):
            # Loop over row P[i, perm[j]] in sigma searching for values
            # in P[i, :] vector i.e. compare
            #     sigma[P[i, perm[j], :]
            # against
            #     P[i,:]
            # To do this we need our sparse row vector with columns 
            #     indices[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # and data/values
            #     data[indptr[P[i, perm[j]]], indptr[P[i, perm[j]]+1]]
            # which comes from the csr matrix format.
            # We also need our sorted indexing vector
            #     P[i, perm[:]]

            # We begin by pointing at the top of both
            # our vectors and gradually move down them. In the event of 
            # an equality we add the data to sub_matrices[i,:,:] and 
            # increment the INDEXING VECTOR pointer, not the sparse
            # row vector pointer, as there can be multiple values that 
            # are the same in the indexing vector but not the sparse row
            # column vector (only 1 column can appear in 1 row!).

            if symmetric:
                index_pointer = j  # Only search upper triangular
            else:
                index_pointer = 0
            sparse_row_pointer = indptr[i][P[i, perm[i, j]]]

            while ((index_pointer < k) and 
                   (sparse_row_pointer < indptr[i][P[i, perm[i, j]] + 1])):
                if indices[i][sparse_row_pointer] == P[i, perm[i, index_pointer]]:
                    # We can add data to sub_matrices
                    sub_matrices[i, perm[i, j], perm[i, index_pointer]] = \
                           data[i][sparse_row_pointer]

                    if symmetric:
                        sub_matrices[i, perm[i, index_pointer], perm[i, j]] = \
                               data[i][sparse_row_pointer]

                    # Only increment the index pointer
                    index_pointer = index_pointer + 1
                elif indices[i][sparse_row_pointer] > P[i, perm[i, index_pointer]]:
                    # Need to increment index pointer
                    index_pointer = index_pointer + 1
                else:
                    # Need to increment sparse row pointer
                    sparse_row_pointer = sparse_row_pointer + 1

    # Free malloc'd data
    free(data)
    free(indices)
    free(indptr)

Test 测试

To test the code run 测试代码运行

cythonize -i sparse_slice.pyx

where sparse_slice.pyx is th filename. 其中sparse_slice.pyx是文件名。 Then you can use this script: 然后你可以使用这个脚本：

import time
import numpy as np
import scipy as sp
import scipy.sparse
from sparse_slice import sparse_slice_fast_cy

k = 100
N = 20000
m = 10000
samples = 20

# Create sigma matrices
## The sampling of random sparse takes a while so just do a few and 
## then populate with these.
now = time.time()
sigma_samples = []
for i in range(samples):
    sigma_samples.append(sp.sparse.rand(m, m, density=0.001, format='csr'))
    sigma_samples[-1] = sigma_samples[-1] + sigma_samples[-1].T  # Symmetric

## Now make the sigma list from these.
sigma = []
for i in range(N):
    j = np.random.randint(samples)
    sigma.append(sigma_samples[j])
print('Time to make sigma: {}'.format(time.time() - now))

# Create indexer
now = time.time()
P = np.empty([N, k]).astype(int)
for i in range(N):
    P[i, :] = np.random.choice(np.arange(m), k, replace=True)
print('Time to make P: {}'.format(time.time() - now))

# Create objects for holding the slices
sub_matrices_slow = np.empty([N, k, k])
sub_matrices_fast = np.empty([N, k, k])

# Run both slicings
## Slow
now = time.time()
for i in range(N):
    sub_matrices_slow[i,:,:] = sigma[i][np.ix_(P[i,:], P[i,:])].todense()
print('Time to make sub_matrices_slow: {}'.format(time.time() - now))

## Fast
symmetric = 1
now = time.time()
sparse_slice_fast_cy(sigma, P.astype(np.int32), sub_matrices_fast, symmetric)
print('Time to make sub_matrices_fast: {}'.format(time.time() - now))

assert(np.all((sub_matrices_slow - sub_matrices_fast)**2 < 1e-6))

Answer 1

Cannot test right now, but there are two suggestions: 现在无法测试，但有两个建议：

A) sort all rows at once onside of the i -loop: A）对i -loop的所有行进行排序：

# Object for the ordered P
cdef long[:,:] perm = np.argsort(P, axis=1)

maybe you will need to pass P as np.ndarray[np.int64_t, ndim=2] P (or whatever type it is) to avoid copying. 也许你需要传递P作为np.ndarray[np.int64_t, ndim=2] P （或任何类型）以避免复制。 You will have to access the data via perm[i,X] instead of perm[X] . 您必须通过perm[i,X]而不是perm[X]访问数据。

B) define B）定义

cdef np.int32_t[:] indices
cdef np.int32_t[:] indptr

So you don't need to copy the data via '.astype`, ie 所以你不需要通过'.astype`复制数据，即

for i in range(N):
    data     = sigma[i].data
    indices  = sigma[i].indices
    indptr   = sigma[i].indptr

I think because the sigma[i] has O(m) elements the copying is the bottleneck of your function: you get running time O(N*(m+k^2)) instead of `O(N*k^2) - it is good to avoid it. 我认为因为sigma[i]有O(m)元素，复制是你函数的瓶颈：你得到运行时间O(N*(m+k^2))而不是'O（N * k ^ 2） - 避免它是好的。

Otherwise the function doesn't look too bad. 否则功能看起来不太糟糕。

For getting prange to work with i -loop, you should move the accesses to sigma[i] outside of the loop by creating a kind of arrays of pointers to the first element of data , indices and indptr and populating them in a cheap preprocess-step. 为了让prange与i -loop一起工作，你应该通过创建一种指向data的第一个元素， indices和indptr的指针数组，并在一个廉价的预处理中填充它们，来将访问移动到循环之外的sigma[i] 。步。 One can make it work, but the question is how much is the gain from the parallelization - it might well be the case, that the problem is memory-bound - one has to see timings. 一个人可以使它工作，但问题是并行化带来了多少收益 - 很可能是这样，问题是内存限制的 - 人们必须看到时间安排。

You could also use the symmetry by processing only the upper triangle matrix: 您也可以通过仅处理上三角矩阵来使用对称性：

  ...
  index_pointer = j #only upper triangle!
  ....
  ....
     # We can add data to sub_matrices
     #upper triangle sub-matrix:
     sub_matrices[i, perm[j], perm[index_pointer]] = \
                       data[sparse_row_pointer]
     #lower triangle sub-matrix:
     sub_matrices[i, perm[index_pointer], perm[j]] = \
                       data[sparse_row_pointer]
  ....

I would start with B) and see how it works out... 我会从B）开始，看看它是如何工作的......

Edit: 编辑：

On memory usage: one can measure the peak memory usage via 关于内存使用情况：可以通过测量峰值内存使用情况

 /usr/bin/time -f "peak_used_memory:%M(in Kb)" python test.py

I run my tests with N=2000 and get (python3.6+cython0.27.1): 我用N=2000运行我的测试并得到（python3.6 + cython0.27.1）：

                             peak memory usage
only slow                       245Mb
only fast                       245Mb
slow+fast no check              402Mb
slow+fast+assert                576Mb

So there is 50Mb overhead, 200Mb used by either function and additional 176 Mb for evaluation the assert. 因此，有50Mb的开销，200Mb用于任一功能，另外176Mb用于评估断言。 I can see the same behavior also for other values of N . 我也可以看到N其他值也有相同的行为。

So I would say there is no huge memory usage by cython. 所以我想说cython没有大量的内存使用量。

This task is very probably (at least partly) memory bound, so the parallelization will not help much. 此任务很可能（至少部分）内存限制，因此并行化将无济于事。 You should reduce the amount of memory loaded to cache. 您应该减少加载到缓存的内存量。

One possibility is not to use perm - after all it also needs to be loaded into the cache. 一种可能性是不使用perm - 毕竟它还需要加载到缓存中。 You could do it if 你可以这样做

you can live with any row/col permutation in matrix sigma, than just sort P and use it. 你可以使用矩阵sigma中的任何row / col排列，而不仅仅是排序P并使用它。
there are very few elements per row, so linear search for every element would be Ok. 每行的元素非常少，因此对每个元素进行线性搜索都可以。
doing binary search for every element 对每个元素进行二进制搜索

I guess you could win about 20-30% in the best case. 我想你可以在最好的情况下赢得大约20-30％。

Sometimes cython produces code which is not easy to optimize for the c-compiler and one achieves often better results writing directly in C and then wrapping it with python. 有时cython产生的代码不容易针对c编译器进行优化，并且通常可以在C中直接编写然后用python包装它。

But I would do all that only if this operation is really, really the bottle-neck of your program. 但是，只有当这个操作确实是你程序的瓶颈时，我才能做到这一切。

By the way, declaring 顺便说一句，宣布

cdef np.int64_t[:,:] perm = np.argsort(P, axis=1)

you will not need additional copying. 你不需要额外的复制。

对称稀疏矩阵的有效切片

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-15 21:13:29

对称稀疏矩阵的有效切片

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-15 21:13:29

解决方案1
2 已采纳 2018-02-15 21:13:29