[英]Speed up operations on a for loop with `coo` matrix and a numpy array in python
I have a numpy array and a coo
matrix. 我有一个numpy数组和一个
coo
矩阵。 I need to update the numpy array based on elements in the coo
matrix. 我需要基于
coo
矩阵中的元素更新numpy数组。 Both the numpy array and the matrix are very large, here is what they look like: numpy数组和矩阵都非常大,如下所示:
graph_array = [[ 1.0 1.0 5.0 9.0]
[ 2.0 5.0 6.0 5.0]
[ 3.0 5.0 7.0 6.0]]
matrix_coo = (1, 5) 0.5
(2, 8) 0.4
(5, 7) 0.8
What I need to do is as follows: 我需要做的如下:
If the second and third element in each list within the array ie list_graph[i][1][2]
(which could be 1,5
, 5,6
or 5,7
) is equal to a row and column pair in the coo
matrix such as (1, 5), (2, 8) or (5, 7)
then the value associated with that pair (for (1, 5)
this equals 0.5
) must replace the fourth element in the list within the array. 如果在阵列中的每个列表中的第二和第三元件即
list_graph[i][1][2]
其可以是1,5
, 5,6
或5,7
)等于在一个行和列的一对coo
矩阵(1, 5), (2, 8) or (5, 7)
则与该对关联的值(对于(1, 5)
等于0.5
)必须替换数组中列表中的第四个元素。
My expected output would thus be: 我的预期输出将是:
output_array = [[ 1.0 1.0 5.0 0.5]
[ 2.0 5.0 6.0 5.0]
[ 3.0 5.0 7.0 0.8]]
The current code I am using is as follows: 我正在使用的当前代码如下:
row_idx = list(matrix_coo.row)
col_idx = list(matrix_coo.col)
data_idx = list(matrix_coo.data)
x = 0
while x < len(row_cost_idx):
for m in graph_array:
if m[1] == row_idx[x]:
if m[2] == col_idx[x]:
m[3] = data_idx[x]
x += 1
It does give me the correct output but because the array has 21596 items and the matrix has 21596 rows it takes a very long time. 它的确为我提供了正确的输出,但是由于该数组有21596个项目,而矩阵有21596行,因此需要很长时间。
Is there a faster way of doing this? 有更快的方法吗?
Your iteration is a pure Python list operation. 您的迭代是纯Python列表操作。 The fact that
row_idx
originated as an attribute of a coo_matrix
doesn't apply 这一事实
row_idx
最初是作为一的属性coo_matrix
不适用
It could be cleaned up a bit with: 可以用以下方法清除它:
What is row_cost_idx
? 什么是
row_cost_idx
? If it is the same as row_idx
we could do 如果与
row_idx
相同,我们可以做
for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
for m in graph_array: # not list_graph?
if m[:2]==[r,c]:
m[3] = d
I think the iteration is the same, but haven't tested it. 我认为迭代是相同的,但尚未对其进行测试。 I'm not sure about speed either.
我也不知道速度。
The double iteration, over nonzero elements of matrix_coo
and sublists of graph_array
is bound to be slow, simply because you are doing very many iterations. 在
matrix_coo
非零元素和graph_array
子列表上的两次迭代注定会很慢,这仅仅是因为您要进行很多次迭代。
If graph_array
was a numpy
array
, we can test all rows at once, with something like 如果
graph_array
是一个numpy
array
,我们可以一次测试所有行,例如
mask = (graph_array[:, :2]==[r,c]).all(axis=1)
graph_array[mask,3] = d
where mask
would have 1's for the rows of graph_array
that have the right indexes. 其中对具有正确索引的
graph_array
行, mask
将为1。 (again this isn't tested) (同样,这未经测试)
To get more speed I'd cast both graph_array
and matrix_coo
as 2d numpy (dense) arrays, and see if I can solve the problem with a few array operations. 为了提高速度,我将
graph_array
和matrix_coo
都graph_array
为2d numpy(密集)数组,并查看是否可以通过一些数组操作解决问题。 Insights from that might help me replace the matrix_coo
iteration. 从中得出的见解可能会帮助我替换
matrix_coo
迭代。
========================= ========================
Tested code 经过测试的代码
import numpy as np
from scipy import sparse
graph_array = np.array([[ 1.0, 1.0, 5.0 , 9.0],
[ 2.0, 5.0 , 6.0 , 5.0],
[ 3.0 , 5.0 , 7.0 , 6.0]])
r,c,d = [1,2,5], [5,8,7],[0.5,0.4,0.8]
matrix_coo = sparse.coo_matrix((d,(r,c)))
def org(graph_array, matrix_coo):
row_idx = list(matrix_coo.row)
col_idx = list(matrix_coo.col)
data_idx = list(matrix_coo.data)
x = 0
while x < len(row_idx):
for m in graph_array:
if m[1] == row_idx[x]:
if m[2] == col_idx[x]:
m[3] = data_idx[x]
x += 1
return graph_array
new_array = org(graph_array.copy(), matrix_coo)
print(graph_array)
print(new_array)
def alt(graph_array, matrix_coo):
for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
for m in graph_array:
if (m[[1,2]]==[r,c]).all(): # array test
m[3] = d
return graph_array
new_array = alt(graph_array.copy(), matrix_coo)
print(new_array)
def altlist(graph_array, matrix_coo):
for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
for m in graph_array:
if (m[1:3]==[r,c]): # list test
m[3] = d
return graph_array
new_array = altlist(graph_array.tolist(), matrix_coo)
print(new_array)
def altarr(graph_array, matrix_coo):
for r,c,d in zip(matrix_coo.row, matrix_coo.col, matrix_coo.data):
mask = (graph_array[:, 1:3]==[r,c]).all(axis=1)
graph_array[mask,3] = d
return graph_array
new_array = alt(graph_array.copy(), matrix_coo)
print(new_array)
run 跑
0909:~/mypy$ python3 stack3727173.py
[[ 1. 1. 5. 9.]
[ 2. 5. 6. 5.]
[ 3. 5. 7. 6.]]
[[ 1. 1. 5. 0.5]
[ 2. 5. 6. 5. ]
[ 3. 5. 7. 0.8]]
[[ 1. 1. 5. 0.5]
[ 2. 5. 6. 5. ]
[ 3. 5. 7. 0.8]]
[[1.0, 1.0, 5.0, 0.5], [2.0, 5.0, 6.0, 5.0], [3.0, 5.0, 7.0, 0.80000000000000004]]
[[ 1. 1. 5. 0.5]
[ 2. 5. 6. 5. ]
[ 3. 5. 7. 0.8]]
For this small example, your function is fastest. 对于这个小例子,您的功能最快。 It also works with both list and array.
它也适用于列表和数组。 For small stuff list operations are often faster than array ones.
对于小型的物料清单操作,通常比数组操作更快。 So using array operations to just compare 2 numbers is not an improvement.
因此,使用数组运算仅比较两个数字并没有改善。
replicating graph_array
1000x the altarr
version is 10x faster than your code. 复制
graph_array
1000倍的altarr
版本比您的代码快10倍。 It's performing array operations on the largest dimension. 它正在最大范围内执行阵列操作。 I haven't tried to increase the size of
matrix_coo
. 我没有尝试增加
matrix_coo
的大小。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.