简体   繁体   English

使用`coo`矩阵和python中的numpy数组加快for循环的操作

[英]Speed up operations on a for loop with `coo` matrix and a numpy array in python

I have a numpy array and a coo matrix. 我有一个numpy数组和一个coo矩阵。 I need to update the numpy array based on elements in the coo matrix. 我需要基于coo矩阵中的元素更新numpy数组。 Both the numpy array and the matrix are very large, here is what they look like: numpy数组和矩阵都非常大,如下所示:

 graph_array = [[  1.0   1.0   5.0  9.0]
 [  2.0   5.0   6.0   5.0]
 [  3.0   5.0   7.0   6.0]]

matrix_coo = (1, 5) 0.5
(2, 8)  0.4
(5, 7)  0.8

What I need to do is as follows: 我需要做的如下:

If the second and third element in each list within the array ie list_graph[i][1][2] (which could be 1,5 , 5,6 or 5,7 ) is equal to a row and column pair in the coo matrix such as (1, 5), (2, 8) or (5, 7) then the value associated with that pair (for (1, 5) this equals 0.5 ) must replace the fourth element in the list within the array. 如果在阵列中的每个列表中的第二和第三元件即list_graph[i][1][2]其可以是1,55,65,7 )等于在一个行和列的一对coo矩阵(1, 5), (2, 8) or (5, 7)则与该对关联的值(对于(1, 5)等于0.5 )必须替换数组中列表中的第四个元素。

My expected output would thus be: 我的预期输出将是:

output_array = [[  1.0   1.0   5.0  0.5]
[  2.0   5.0   6.0   5.0]
[  3.0   5.0   7.0   0.8]]

The current code I am using is as follows: 我正在使用的当前代码如下:

 row_idx = list(matrix_coo.row)
 col_idx = list(matrix_coo.col)
 data_idx = list(matrix_coo.data)

x = 0
    while x < len(row_cost_idx):
        for m in graph_array:
            if m[1] == row_idx[x]:
                if m[2] == col_idx[x]:
                    m[3] = data_idx[x]
        x += 1

It does give me the correct output but because the array has 21596 items and the matrix has 21596 rows it takes a very long time. 它的确为我提供了正确的输出,但是由于该数组有21596个项目,而矩阵有21596行,因此需要很长时间。

Is there a faster way of doing this? 有更快的方法吗?

Your iteration is a pure Python list operation. 您的迭代是纯Python列表操作。 The fact that row_idx originated as an attribute of a coo_matrix doesn't apply 这一事实row_idx最初是作为一的属性coo_matrix不适用

It could be cleaned up a bit with: 可以用以下方法清除它:

What is row_cost_idx ? 什么是row_cost_idx If it is the same as row_idx we could do 如果与row_idx相同,我们可以做

for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
    for m in graph_array: # not list_graph?
        if m[:2]==[r,c]:
            m[3] = d

I think the iteration is the same, but haven't tested it. 我认为迭代是相同的,但尚未对其进行测试。 I'm not sure about speed either. 我也不知道速度。

The double iteration, over nonzero elements of matrix_coo and sublists of graph_array is bound to be slow, simply because you are doing very many iterations. matrix_coo非零元素和graph_array子列表上的两次迭代注定会很慢,这仅仅是因为您要进行很多次迭代。

If graph_array was a numpy array , we can test all rows at once, with something like 如果graph_array是一个numpy array ,我们可以一次测试所有行,例如

mask = (graph_array[:, :2]==[r,c]).all(axis=1)
graph_array[mask,3] = d

where mask would have 1's for the rows of graph_array that have the right indexes. 其中对具有正确索引的graph_array行, mask将为1。 (again this isn't tested) (同样,这未经测试)

To get more speed I'd cast both graph_array and matrix_coo as 2d numpy (dense) arrays, and see if I can solve the problem with a few array operations. 为了提高速度,我将graph_arraymatrix_coograph_array为2d numpy(密集)数组,并查看是否可以通过一些数组操作解决问题。 Insights from that might help me replace the matrix_coo iteration. 从中得出的见解可能会帮助我替换matrix_coo迭代。

========================= ========================

Tested code 经过测试的代码

import numpy as np
from scipy import sparse

graph_array = np.array([[  1.0,   1.0,   5.0 , 9.0],
 [  2.0,   5.0 ,  6.0  , 5.0],
 [  3.0  , 5.0 ,  7.0 ,  6.0]])

r,c,d = [1,2,5], [5,8,7],[0.5,0.4,0.8]
matrix_coo = sparse.coo_matrix((d,(r,c)))

def org(graph_array, matrix_coo):
    row_idx = list(matrix_coo.row)
    col_idx = list(matrix_coo.col)
    data_idx = list(matrix_coo.data)

    x = 0
    while x < len(row_idx):
        for m in graph_array:
            if m[1] == row_idx[x]:
                if m[2] == col_idx[x]:
                    m[3] = data_idx[x]
        x += 1
    return graph_array

new_array = org(graph_array.copy(), matrix_coo)    
print(graph_array)
print(new_array)

def alt(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        for m in graph_array: 
            if (m[[1,2]]==[r,c]).all():  # array test
                m[3] = d
    return graph_array

new_array = alt(graph_array.copy(), matrix_coo)    
print(new_array)

def altlist(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        for m in graph_array:
            if (m[1:3]==[r,c]):   # list test
                m[3] = d
    return graph_array

new_array = altlist(graph_array.tolist(), matrix_coo)    
print(new_array)

def altarr(graph_array, matrix_coo):
    for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
        mask = (graph_array[:, 1:3]==[r,c]).all(axis=1)
        graph_array[mask,3] = d
    return graph_array

new_array = alt(graph_array.copy(), matrix_coo)    
print(new_array)

run

0909:~/mypy$ python3 stack3727173.py 
[[ 1.  1.  5.  9.]
 [ 2.  5.  6.  5.]
 [ 3.  5.  7.  6.]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]
[[1.0, 1.0, 5.0, 0.5], [2.0, 5.0, 6.0, 5.0], [3.0, 5.0, 7.0, 0.80000000000000004]]
[[ 1.   1.   5.   0.5]
 [ 2.   5.   6.   5. ]
 [ 3.   5.   7.   0.8]]

For this small example, your function is fastest. 对于这个小例子,您的功能最快。 It also works with both list and array. 它也适用于列表和数组。 For small stuff list operations are often faster than array ones. 对于小型的物料清单操作,通常比数组操作更快。 So using array operations to just compare 2 numbers is not an improvement. 因此,使用数组运算仅比较两个数字并没有改善。

replicating graph_array 1000x the altarr version is 10x faster than your code. 复制graph_array 1000倍的altarr版本比您的代码快10倍。 It's performing array operations on the largest dimension. 它正在最大范围内执行阵列操作。 I haven't tried to increase the size of matrix_coo . 我没有尝试增加matrix_coo的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM