numpy.where on 2D array is slower than list comprehensino of numpy.where on 1D array

Question

Why is Numpy slower than list comprehensions in this case?
What is the best way to vectorize this grid-construction?

In [1]: import numpy as np

In [2]: mesh = np.linspace(-1, 1, 3000)

In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)

In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)

# Using 2D arrays:
In [5]: %timeit f(colwise, rowwise)
285 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Using 1D array and list-comprehension:
In [6]: %timeit np.array([f(x, mesh) for x in mesh])
58 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Equivalent result
In [7]: np.allclose(f(colwise, rowwise), np.array([f(x, mesh) for x in mesh]))
True

Answer 1

Why is Numpy slower than list comprehensions in this case?

you are essentially suffering from 2 problems.

the first one is the cache utilization, the second version uses a smaller subset of the space ~~(3000,1)~~ (1,3000) for the calculation which can fit nicely in you cache, so x>y, x**2, x**3 can all fit inside your cache which somewhat speeds things up, the first version is calculating each of those 3 for a 3000x3000 array (9 million entries) which can never sit inside your cache (usually ~ 2-5 MB), then np.where is called that has to get parts of the data from your RAM (and not cache) in order to do its memory copying, which is then returned piece by piece to your RAM, which is very expensive.

also numpy implementation of np.where is somewhat alignment unaware and is accessing your arrays column-wise, not row-wise, so it's essentially grabbing each and every entry from your RAM, and not utilizing the cache at all.

your list comprehension actually solves this issue as it is only dealing with a small subset of data at a given time, and therefore all the data can sit in you cache, but it is still using np.where, it's only forcing it to use a row-wise access and therefore utilize your cache.

the second problem is the calculation of x**2 and x**3 which is a floating point exponentiation, which is very expensive, consider replacing it with x*x and x*x*x

What is the best way to vectorize this grid-construction?

apparently you have written it in your second method.

an even faster but unnecessary optimization by utilization of cache is to write your own code in C and call it from within python so you don't have to evaluate either x*x or x*x*x unless you need to, and won't have to store x>y,x*x,x*x*x but the speedup won't be worth the trouble.

Answer 2

In [1]: In [2]: mesh = np.linspace(-1, 1, 3000)
   ...: In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)
   ...: In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)

In addition lets make the sparse grid:

In [2]: r1,c1 = np.meshgrid(mesh,mesh,sparse=True)
In [3]: rowwise.shape
Out[3]: (3000, 3000)
In [4]: r1.shape
Out[4]: (1, 3000)

With the sparse grid, times are even better than your iteration:

In [5]: timeit f(colwise, rowwise)
645 ms ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit f(c1,r1)
108 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: timeit np.array([f(x, mesh) for x in mesh])
166 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The other answer stresses caching. Other posts have shown that a modest amount of iteration can be faster than working with very large arrays, such as when using matmul . I don't know if it's caching or some other memory management complication that slows this down.

But at 3000*3000*8 bytes I'm not sure that's the issue here. Instead I think it's the time the x**2 and x**3 expressions require.

The arguments of the where are evaluated before being passed in.

The condition expression takes a modest amount of time:

In [8]: timeit colwise>rowwise
24.2 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But the power expression for the (3000,3000) array takes a majority of the total time:

In [9]: timeit rowwise**3
467 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Contrast that with the time required for the sparse equivalent:

In [10]: timeit r1**3
142 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This time is 3288x faster; that's a bit worse that O(n) scaling.

repeated multiply is better:

In [11]: timeit rowwise*rowwise*rowwise
116 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [f(x, mesh) for x in mesh] , x**3 operates on a scalar, so is fast, even though it's repeated 3000 times.

In fact if we take the power calculations out of the timing, the whole array where is relatively fast:

In [15]: %%timeit x2,x3 = rowwise**2, rowwise**3
    ...: np.where(rowwise>colwise, x2,x3)
89.8 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

numpy.where on 2D array is slower than list comprehensino of numpy.where on 1D array

Question

2 answers

solution1
1 2021-12-07 08:55:22

solution2
0 2021-12-07 17:43:13

numpy.where on 2D array is slower than list comprehensino of numpy.where on 1D array

Question

2 answers

solution1 1 2021-12-07 08:55:22

solution2 0 2021-12-07 17:43:13

solution1
1 2021-12-07 08:55:22

solution2
0 2021-12-07 17:43:13