简体   繁体   English

从另一个二维索引数组重新排列二维 numpy 数组的最有效方法

[英]Most efficient way to rearrange 2D numpy array from another 2D index array

In brief简单来说

In Python 3.6 and using Numpy, what would be the most efficient way to rearrange the elements of a 2D array according to indices present in a different, similarly shaped, index 2D array?在 Python 3.6 和使用 Numpy 中,根据不同、形状相似的索引二维数组中存在的索引重新排列二维数组元素的最有效方法是什么?

Detailed详细的

Suppose I have the following two 9 x 5 arrays, called A and B:假设我有以下两个 9 x 5 arrays,分别称为 A 和 B:

import numpy as np
A = np.array([[0.32, 0.35, 0.88, 0.63, 1.  ],
              [0.23, 0.69, 0.98, 0.22, 0.96],
              [0.7 , 0.51, 0.09, 0.58, 0.19],
              [0.98, 0.42, 0.62, 0.94, 0.46],
              [0.48, 0.59, 0.17, 0.23, 0.98]])

B = np.array([[4, 0, 3, 2, 1],
              [3, 2, 4, 1, 0],
              [4, 3, 0, 2, 1],
              [4, 2, 0, 3, 1],
              [0, 3, 1, 2, 4]])

I can successfully rearrange A using B as an index array by it by np.array(list(map(lambda i, j: j[i], B, A))) :我可以通过np.array(list(map(lambda i, j: j[i], B, A)))成功地使用 B 作为索引数组重新排列 A :

array([[1.  , 0.32, 0.63, 0.88, 0.35],
       [0.22, 0.98, 0.96, 0.69, 0.23],
       [0.19, 0.58, 0.7 , 0.09, 0.51],
       [0.46, 0.62, 0.98, 0.94, 0.42],
       [0.48, 0.23, 0.59, 0.17, 0.98]])

However, when the dimensions of A and B increase, such a solution becomes really inefficient.但是,当 A 和 B 的维度增加时,这样的解决方案变得非常低效。 If I am not mistaken, that is because:如果我没记错的话,那是因为:

  • using the lambda loops over all rows of A instead of relying on Numpy vectorizations使用 lambda 循环 A 的所有行,而不是依赖于 Numpy 向量化
  • mapping is slow映射很慢
  • converting list to array eats precious time.将列表转换为数组会消耗宝贵的时间。

Since in my real use case those arrays can grow quite big, and I have to reorder many of them in a long loop, a lot of my current performance bottleneck (measured with a profiler) comes from that single line of code above.由于在我的实际用例中,那些 arrays 可能会变得非常大,并且我必须在一个长循环中对它们中的许多进行重新排序,因此我当前的许多性能瓶颈(使用分析器测量)来自上面的那一行代码。

My question: what would the most efficient, more Numpy-smart way of achieving the above?我的问题:实现上述目标的最有效、更 Numpy 智能的方法是什么?

A toy code to test general arrays and time the process could be:测试一般 arrays 的玩具代码和时间过程可能是:

import numpy as np
nRows = 20000
nCols = 10000
A = np.round(np.random.uniform(0, 1, (nRows, nCols)), 2)
B = np.full((nRows, nCols), range(nCols))
for r in range(nRows):
    np.random.shuffle(B[r])
%time X = np.array(list(map(lambda i, j: j[i], B, A)))

A comparison with three other possibilities:与其他三种可能性的比较:

import numpy as np
import time

# Input
nRows = 20000
nCols = 10000
A = np.round(np.random.uniform(0, 1, (nRows, nCols)), 2)
B = np.full((nRows, nCols), range(nCols))
for r in range(nRows):
  np.random.shuffle(B[r])

# Original
t_start = time.time()
X = np.array(list(map(lambda i, j: j[i], B, A)))
print('Timer 1:', time.time()-t_start, 's')

# FOR loop
t_start = time.time()
X = np.zeros((nRows, nCols))
for i in range(nRows):
  X[i] = A[i][B[i]]
print('Timer 2:', time.time()-t_start, 's')

# take_along_axis
t_start = time.time()
X = np.take_along_axis(A,B,1)
print('Timer 3:', time.time()-t_start, 's')

# Indexing
t_start = time.time()
X = A[ np.arange(nRows)[:,None],B]
print('Timer 4:', time.time()-t_start, 's')

Ouput:输出:

% python3 script.py
Timer 1: 2.191567897796631 s
Timer 2: 1.3516249656677246 s
Timer 3: 1.675267219543457 s
Timer 4: 1.646852970123291 s

For low number of columns (nRows,nCols)=(200000,10) the results are completely different however:但是,对于少量列(nRows,nCols)=(200000,10) ,结果完全不同:

% python3 script.py
Timer 1: 0.2729799747467041 s
Timer 2: 0.22678399085998535 s
Timer 3: 0.016162633895874023 s
Timer 4: 0.014748811721801758 s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM