bad performance with partitioning of numpy arrays

Question

I am new with numpy arrays and running into a performance issue,

processing of 3M rows takes around 8min and I wondering, whether the partitioning of the numpy arrays as shown below is the best way to process the results of the numpy array,

   import re, math, time
   import numpy as np
   from tqdm import tqdm

   hdf5_array=np.random.rand(3000000, 3, 4, 8, 1, 1, 1, 2)
   ndarray = np.squeeze(hdf5_array)
   print (hdf5_array.shape, ndarray.shape)
   num_elm = ndarray.shape[0]
   num_iter = ndarray.shape[2]
   num_int_points = ndarray.shape[3]
   res_array = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
   for i, row in enumerate(tqdm(ndarray)):
           for xyz in range(3):
               xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
               for iter in range(num_iter):
                   iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
                   mean_list = np.mean(iter_row, axis=0)
   print (type(res_array), res_array.ndim, res_array.dtype, res_array.shape)

finally a mean value of results should be created and saved into a new array, but maybe also the nested loops are the problem but I assume that can not be avoided?

maybe someone has a good hint in what direction should I go to improve the performance?

Answer 1

the basic idea here is an array from a hdf5 file that should be processed to get average value of 8 different values in that array,

so finally I want to have as result an array of size (4, 3000000, 3, 2) that contains the average value of the 8 values in the orig array, the rest should be the same,

but to touch all 8 values that needs to be averaged, I go into the loops and separate them,

if avoiding the last step and avoiding np.mean and use instead a loop over the [8,2] array gives a little bit speed-up, but only a little...

    sum_r = 0.0 
    sum_i = 0.0
    for p in range(num_int_points):
        sum_r = sum_r + iter_row[p][0]
        sum_i = sum_i + iter_row[p][1]
    res_array[iter, i, xyz, 0:2] = [sum_r / float(num_int_points), sum_i / float(num_int_points)]

Answer 2

The nested loops are certainly killing your performance.

We can directly perform this computation with:

%%time

res_array_direct = np.swapaxes(np.swapaxes(np.mean(ndarray, axis=3), 0, 1), 0, 2)

with timing

CPU times: total: 6.86 s
Wall time: 6.84 s

This is incredibly fast compared to the nested loops because it takes full advantage of NumPy being written in C. Once you introduce the nested loops, you are performing Python loops and operations directly which is far less efficient.

Summarizing the timing:

Direct : 6.48 s
1 Loop : 39.9 s
2 Loops: 124 s = 2 min 4 s
3 Loops: 473 s = 7 min 53 s

Details below:

We can see the progressive effect of the loops. Let's add one loop back in:

%%time

res_array_1 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    res_array_1[:, i, :, :] = np.swapaxes(np.mean(row, axis=2), 0, 1)

print(np.allclose(res_array_direct, res_array_1))

This single, manual loop versus the vectorization takes us from ~7s to ~40s

100%|██████████| 3000000/3000000 [00:38<00:00, 77730.88it/s]
True
CPU times: total: 39.9 s
Wall time: 39.6 s

With the second, manual loop we have code:

%%time

res_array_2 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        res_array_2[:, i, xyz, :] = np.mean(xyz_array, axis=1)

print(np.allclose(res_array_direct, res_array_2))

and output

100%|██████████| 3000000/3000000 [02:03<00:00, 24387.97it/s]
True
CPU times: total: 2min 4s
Wall time: 2min 4s

Up to 2 minutes, Finally, with all 3 loops you have, we get

%%time

res_array_3 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        for iter in range(num_iter):
            iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
            mean_list = np.mean(iter_row, axis=0)
            res_array_3[iter, i, xyz, :] = mean_list

print(np.allclose(res_array_direct, res_array_3))

and output

100%|██████████| 3000000/3000000 [07:52<00:00, 6348.42it/s]
True
CPU times: total: 7min 57s
Wall time: 7min 53s

bad performance with partitioning of numpy arrays

Question

2 answers

solution1
0 2022-12-20 13:27:50

solution2
0 2023-01-04 21:46:48

bad performance with partitioning of numpy arrays

Question

2 answers

solution1 0 2022-12-20 13:27:50

solution2 0 2023-01-04 21:46:48

solution1
0 2022-12-20 13:27:50

solution2
0 2023-01-04 21:46:48