简体   繁体   中英

Compute sparse transitive closure of scipy sparse matrix

I want to compute the transitive closure of a sparse matrix in Python. Currently I am using scipy sparse matrices.

The matrix power ( **12 in my case) works well on very sparse matrices, no matter how large they are, but for directed not-so-sparse cases I would like to use a smarter algorithm.

I have found the Floyd-Warshall algorithm (German page has better pseudocode) in scipy.sparse.csgraph , which does a bit more than it should: there is no function only for Warshall's algorithm - that is one thing.

The main problem is that I can pass a sparse matrix to the function, but this is utterly senseless as the function will always return a dense matrix, because what should be 0 in the transitive closure is now a path of inf length and someone felt this needs to be stored explicitly.

So my question is: Is there any python module that allows computing the transitive closure of a sparse matrix and keeps it sparse ?

I am not 100% sure that he works with the same matrices, but Gerald Penn shows impressive speed-ups in his comparison paper , which suggests that it is possible to solve the problem.


EDIT: As there were a number of confusions, I will point out the theoretical background:

I am looking for the transitive closure (not reflexive or symmetric).

I will make sure that my relation encoded in a boolean matrix has the properties that are required, ie symmetry or reflexivity .

I have two cases of the relation :

  1. reflexive
  2. reflexive and symmetric

在此输入图像描述 在此输入图像描述

I want to apply the transitive closure on those two relations. This works perfectly well with matrix power (only that in certain cases it is too expensive):

>>> reflexive
matrix([[ True,  True, False,  True],
        [False,  True,  True, False],
        [False, False,  True, False],
        [False, False, False,  True]])
>>> reflexive**4
matrix([[ True,  True,  True,  True],
        [False,  True,  True, False],
        [False, False,  True, False],
        [False, False, False,  True]])
>>> reflexive_symmetric
matrix([[ True,  True, False,  True],
        [ True,  True,  True, False],
        [False,  True,  True, False],
        [ True, False, False,  True]])
>>> reflexive_symmetric**4
matrix([[ True,  True,  True,  True],
        [ True,  True,  True,  True],
        [ True,  True,  True,  True],
        [ True,  True,  True,  True]])

So in the first case, we get all the descendents of a node (including itself) and in the second, we get all the components, that is all the nodes that are in the same component.

在此输入图像描述 在此输入图像描述

This was brought up on SciPy issue tracker . Problem is not so much the output format; the implementation of Floyd-Warshall is to begin with the matrix full of infinities and then insert finite values when a path is found. Sparsity is lost immediately.

The networkx library offers an alternative with its all_pairs_shortest_path_length . Its output is an iterator which returns tuples of the form

(source, dictionary of reachable targets) 

which takes a little work to convert to a SciPy sparse matrix (csr format is natural here). A complete example:

import numpy as np
import networkx as nx
import scipy.stats as stats
import scipy.sparse as sparse

A = sparse.random(6, 6, density=0.2, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
G = nx.DiGraph(A)       # directed because A need not be symmetric
paths = nx.all_pairs_shortest_path_length(G)
indices = []
indptr = [0]
for row in paths:
  reachable = [v for v in row[1] if row[1][v] > 0]
  indices.extend(reachable)
  indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = A + sparse.csr_matrix((data, indices, indptr), shape=A.shape)
print(A, "\n\n", A_trans)

The reason for adding A back is as follows. Networkx output includes paths of length 0, which would immediately fill the diagonal. We don't want that to happen (you wanted transitive closure, not reflexive-and-transitive closure). Hence the line reachable = [v for v in row[1] if row[1][v] > 0] . But then we don't get any diagonal entries at all, even where A had them (the 0-length empty path beats 1-length path formed by self-loop). So I add A back to the result. It now has entries 1 or 2 but only the fact they are nonzero is of significance.

An example of running the above (I pick 6 by 6 size for readability of the output). Original matrix:

  (0, 3)    1
  (3, 2)    1
  (4, 3)    1
  (5, 1)    1
  (5, 3)    1
  (5, 4)    1
  (5, 5)    1 

Transitive closure:

  (0, 2)    1
  (0, 3)    2
  (3, 2)    2
  (4, 2)    1
  (4, 3)    2
  (5, 1)    2
  (5, 2)    1
  (5, 3)    2
  (5, 4)    2
  (5, 5)    1

You can see that this worked correctly: the added entries are (0, 2), (4, 2), and (5, 2), all acquired via the path (3, 2).


By the way, networkx also has floyd_warshall method but its documentation says

This algorithm is most appropriate for dense graphs. The running time is O(n^3), and running space is O(n^2) where n is the number of nodes in G.

The output is dense again. I get the impression that this algorithm is just considered dense by nature. It seems the all_pairs_shortest_path_length is a kind of Dijkstra's algorithm .

Transitive and Reflexive

If instead of transitive closure (which is the smallest transitive relation containing the given one) you wanted transitive and reflexive closure (the smallest transitive and reflexive relation containing the given one) , the code simplifies as we no longer worry about 0-length paths.

for row in paths:
  indices.extend(row[1])
  indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = sparse.csr_matrix((data, indices, indptr), shape=A.shape)

Transitive, Reflexive, and Symmetric

This means finding the smallest equivalence relation containing the given one. Equivalently, dividing the vertices into connected components. For this you don't need to go to networkx, there is connected_components method of SciPy. Set directed=False there. Example:

import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import itertools

A = sparse.random(20, 20, density=0.02, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
components = sparse.csgraph.connected_components(A, directed=False)
nonzeros = []
for k in range(components[0]):
  idx = np.where(components[1] == k)[0]
  nonzeros.extend(itertools.product(idx, idx))
  row = tuple(r for r, c in nonzeros)
  col = tuple(c for r, c in nonzeros)
  data = np.ones_like(row)
B = sparse.coo_matrix((data, (row, col)), shape=A.shape)

This is what the output print(B.toarray()) looks like for a random example, 20 by 20:

[[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM