简体   繁体   English

分区性能差 numpy arrays

[英]bad performance with partitioning of numpy arrays

I am new with numpy arrays and running into a performance issue,我是 numpy arrays 的新手,遇到了性能问题,

processing of 3M rows takes around 8min and I wondering, whether the partitioning of the numpy arrays as shown below is the best way to process the results of the numpy array,处理 3M 行大约需要 8 分钟,我想知道如下所示的 numpy arrays 的分区是否是处理 numpy 数组结果的最佳方式,

   import re, math, time
   import numpy as np
   from tqdm import tqdm

   hdf5_array=np.random.rand(3000000, 3, 4, 8, 1, 1, 1, 2)
   ndarray = np.squeeze(hdf5_array)
   print (hdf5_array.shape, ndarray.shape)
   num_elm = ndarray.shape[0]
   num_iter = ndarray.shape[2]
   num_int_points = ndarray.shape[3]
   res_array = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
   for i, row in enumerate(tqdm(ndarray)):
           for xyz in range(3):
               xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
               for iter in range(num_iter):
                   iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
                   mean_list = np.mean(iter_row, axis=0)
   print (type(res_array), res_array.ndim, res_array.dtype, res_array.shape)

finally a mean value of results should be created and saved into a new array, but maybe also the nested loops are the problem but I assume that can not be avoided?最后应该创建一个结果的平均值并将其保存到一个新数组中,但也许嵌套循环也是问题所在但我认为这是无法避免的?

maybe someone has a good hint in what direction should I go to improve the performance?也许有人对我应该 go 提高性能的方向有很好的提示?

the basic idea here is an array from a hdf5 file that should be processed to get average value of 8 different values in that array,这里的基本思想是一个来自 hdf5 文件的数组,应该对其进行处理以获得该数组中 8 个不同值的平均值,

so finally I want to have as result an array of size (4, 3000000, 3, 2) that contains the average value of the 8 values in the orig array, the rest should be the same,所以最后我想得到一个大小为 (4, 3000000, 3, 2) 的数组,其中包含 orig 数组中 8 个值的平均值,其余的应该相同,

but to touch all 8 values that needs to be averaged, I go into the loops and separate them,但是为了触及需要平均的所有 8 个值,我进入循环并将它们分开,

if avoiding the last step and avoiding np.mean and use instead a loop over the [8,2] array gives a little bit speed-up, but only a little...如果避免最后一步并避免 np.mean 并改用 [8,2] 数组上的循环可以加快速度,但只有一点点......

    sum_r = 0.0 
    sum_i = 0.0
    for p in range(num_int_points):
        sum_r = sum_r + iter_row[p][0]
        sum_i = sum_i + iter_row[p][1]
    res_array[iter, i, xyz, 0:2] = [sum_r / float(num_int_points), sum_i / float(num_int_points)]

The nested loops are certainly killing your performance.嵌套循环肯定会扼杀你的表现。

We can directly perform this computation with:我们可以直接执行此计算:

%%time

res_array_direct = np.swapaxes(np.swapaxes(np.mean(ndarray, axis=3), 0, 1), 0, 2)

with timing随着时间

CPU times: total: 6.86 s
Wall time: 6.84 s

This is incredibly fast compared to the nested loops because it takes full advantage of NumPy being written in C. Once you introduce the nested loops, you are performing Python loops and operations directly which is far less efficient.与嵌套循环相比,这是非常快的,因为它充分利用了 C 中编写的 NumPy。一旦引入嵌套循环,您将直接执行 Python 循环和操作,效率要低得多。

Summarizing the timing:时间总结:

Direct : 6.48 s
1 Loop : 39.9 s
2 Loops: 124 s = 2 min 4 s
3 Loops: 473 s = 7 min 53 s

Details below:详情如下:

We can see the progressive effect of the loops.我们可以看到循环的渐进效果。 Let's add one loop back in:让我们重新添加一个循环:

%%time

res_array_1 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    res_array_1[:, i, :, :] = np.swapaxes(np.mean(row, axis=2), 0, 1)

print(np.allclose(res_array_direct, res_array_1))

This single, manual loop versus the vectorization takes us from ~7s to ~40s与矢量化相比,这个单一的手动循环将我们从大约 7 秒缩短到大约 40 秒

100%|██████████| 3000000/3000000 [00:38<00:00, 77730.88it/s]
True
CPU times: total: 39.9 s
Wall time: 39.6 s

With the second, manual loop we have code:对于第二个手动循环,我们有代码:

%%time

res_array_2 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        res_array_2[:, i, xyz, :] = np.mean(xyz_array, axis=1)

print(np.allclose(res_array_direct, res_array_2))

and output和 output

100%|██████████| 3000000/3000000 [02:03<00:00, 24387.97it/s]
True
CPU times: total: 2min 4s
Wall time: 2min 4s

Up to 2 minutes, Finally, with all 3 loops you have, we get最多 2 分钟,最后,通过所有 3 个循环,我们得到

%%time

res_array_3 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        for iter in range(num_iter):
            iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
            mean_list = np.mean(iter_row, axis=0)
            res_array_3[iter, i, xyz, :] = mean_list

print(np.allclose(res_array_direct, res_array_3))

and output和 output

100%|██████████| 3000000/3000000 [07:52<00:00, 6348.42it/s]
True
CPU times: total: 7min 57s
Wall time: 7min 53s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM