Speed up operations on a for loop with `coo` matrix and a numpy array in python

Question

I have a numpy array and a coo matrix. I need to update the numpy array based on elements in the coo matrix. Both the numpy array and the matrix are very large, here is what they look like:

 graph_array = [[  1.0   1.0   5.0  9.0]
 [  2.0   5.0   6.0   5.0]
 [  3.0   5.0   7.0   6.0]]

matrix_coo = (1, 5) 0.5
(2, 8)  0.4
(5, 7)  0.8

What I need to do is as follows:

If the second and third element in each list within the array ie list_graph[i][1][2] (which could be 1,5 , 5,6 or 5,7 ) is equal to a row and column pair in the coo matrix such as (1, 5), (2, 8) or (5, 7) then the value associated with that pair (for (1, 5) this equals 0.5 ) must replace the fourth element in the list within the array.

My expected output would thus be:

output_array = [[  1.0   1.0   5.0  0.5]
[  2.0   5.0   6.0   5.0]
[  3.0   5.0   7.0   0.8]]

The current code I am using is as follows:

 row_idx = list(matrix_coo.row)
 col_idx = list(matrix_coo.col)
 data_idx = list(matrix_coo.data)

x = 0
    while x < len(row_cost_idx):
        for m in graph_array:
            if m[1] == row_idx[x]:
                if m[2] == col_idx[x]:
                    m[3] = data_idx[x]
        x += 1

It does give me the correct output but because the array has 21596 items and the matrix has 21596 rows it takes a very long time.

Is there a faster way of doing this?

Answer 1

Your iteration is a pure Python list operation. The fact that row_idx originated as an attribute of a coo_matrix doesn't apply

It could be cleaned up a bit with:

What is row_cost_idx ? If it is the same as row_idx we could do

for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
    for m in graph_array: # not list_graph?
        if m[:2]==[r,c]:
            m[3] = d

I think the iteration is the same, but haven't tested it. I'm not sure about speed either.

The double iteration, over nonzero elements of matrix_coo and sublists of graph_array is bound to be slow, simply because you are doing very many iterations.

If graph_array was a numpy array , we can test all rows at once, with something like

mask = (graph_array[:, :2]==[r,c]).all(axis=1)
graph_array[mask,3] = d

where mask would have 1's for the rows of graph_array that have the right indexes. (again this isn't tested)

To get more speed I'd cast both graph_array and matrix_coo as 2d numpy (dense) arrays, and see if I can solve the problem with a few array operations. Insights from that might help me replace the matrix_coo iteration.

=========================

Tested code

import numpy as np
from scipy import sparse

graph_array = np.array([[  1.0,   1.0,   5.0 , 9.0],
 [  2.0,   5.0 ,  6.0  , 5.0],
 [  3.0  , 5.0 ,  7.0 ,  6.0]])

r,c,d = [1,2,5], [5,8,7],[0.5,0.4,0.8]
matrix_coo = sparse.coo_matrix((d,(r,c)))

def org(graph_array, matrix_coo):
    row_idx = list(matrix_coo.row)
    col_idx = list(matrix_coo.col)
    data_idx = list(matrix_coo.data)

    x = 0
    while x < len(row_idx):
        for m in graph_array:
            if m[1] == row_idx[x]:
                if m[2] == col_idx[x]:
                    m[3] = data_idx[x]
        x += 1
    return graph_array

new_array = org(graph_array.copy(), matrix_coo)    
print(graph_array)
print(new_array)

def alt(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        for m in graph_array: 
            if (m[[1,2]]==[r,c]).all():  # array test
                m[3] = d
    return graph_array

new_array = alt(graph_array.copy(), matrix_coo)    
print(new_array)

def altlist(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        for m in graph_array:
            if (m[1:3]==[r,c]):   # list test
                m[3] = d
    return graph_array

new_array = altlist(graph_array.tolist(), matrix_coo)    
print(new_array)

def altarr(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        mask = (graph_array[:, 1:3]==[r,c]).all(axis=1)
        graph_array[mask,3] = d
    return graph_array

new_array = alt(graph_array.copy(), matrix_coo)    
print(new_array)

run

0909:~/mypy$ python3 stack3727173.py 
[[ 1.  1.  5.  9.]
 [ 2.  5.  6.  5.]
 [ 3.  5.  7.  6.]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]
[[1.0, 1.0, 5.0, 0.5], [2.0, 5.0, 6.0, 5.0], [3.0, 5.0, 7.0, 0.80000000000000004]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]

For this small example, your function is fastest. It also works with both list and array. For small stuff list operations are often faster than array ones. So using array operations to just compare 2 numbers is not an improvement.

replicating graph_array 1000x the altarr version is 10x faster than your code. It's performing array operations on the largest dimension. I haven't tried to increase the size of matrix_coo .

Speed up operations on a for loop with `coo` matrix and a numpy array in python

Question

1 answers

solution1
1 ACCPTED 2016-05-17 18:24:24

Speed up operations on a for loop with `coo` matrix and a numpy array in python

Question

1 answers

solution1 1 ACCPTED 2016-05-17 18:24:24

solution1
1 ACCPTED 2016-05-17 18:24:24