简体   繁体   中英

Why is subtraction faster when doing arithmetic with a Numpy array and a int compared to using vectorization with two Numpy arrays?

I am confused as to why this code:

start = time.time()
for i in range(1000000):
    _ = 1 - np.log(X)
print(time.time()-start)

Executes faster than this implementation:

start = time.time()
for i in range(1000000):
    _ = np.subtract(np.ones_like(X), np.log(X))
print(time.time()-start)

My understanding was that it should be the opposite, as in the second implementation I'm utilizing the speed-up provided by vectorization, since it's able to operate the elements in X simultaneously rather than going sequentially, which is how I assumed the first implementation functions.

Can someone shed some light on this for me, as I am genuinely confused? Thank you!

Both versions of your code are equally vectorized. The array you created to try to vectorize the second version is just overhead.


NumPy vectorization doesn't refer to hardware vectorization. If the compiler is smart enough, it might end up using hardware vectorization, but NumPy doesn't explicitly use AVX or anything.

NumPy vectorization refers to writing Python-level code that operates on entire arrays at once, not using hardware instructions that operate on multiple operands at once. It's vectorization at the Python level, not at the machine language level. The benefit of this over writing explicit loops is that NumPy can perform the work in C-level loops instead of Python, avoiding a massive amount of dynamic dispatch, boxing, unboxing, trips through the bytecode evaluation loop, etc.

Both versions of your code are vectorized in that sense, but the second one wastes a bunch of memory and memory bandwidth on writing and reading a giant array of ones.

Also, even if we were talking about hardware-level vectorization, the 1 - version would be just as amenable to hardware-level vectorization as the other version. You would just load the scalar 1 into all positions of a vector register and proceed as normal. It would involve far less transfers to and from memory than the second version, thus still probably running faster than the second version.

Times are substantially the same. As others point out, there's isn't any sort of hardware, or multicore parallelization, just a mix of interpreted Python and compiled numpy functions.

In [289]: x = np.ones((1000,1000))

In [290]: timeit 1-np.log(x)                                                    
15 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [291]: timeit np.subtract(np.ones_like(x), np.log(x))                        
18.6 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Take the np.ones_like out the timing loop:

In [292]: %%timeit y = np.ones_like(x) 
     ...: np.subtract(y,np.log(x)) 
     ...:  
     ...:                                                                       
15.7 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2/3 of the time is spent in the log function:

In [303]: timeit np.log(x)                                                      
10.7 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [306]: %%timeit y=np.log(x) 
     ...: np.subtract(1, y)                                                                  
3.77 ms ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The variations in how the 1 is generated are minor part of the timing.

With 'broadcasting', it's just as easy to do math with a scalar and array, or an array and array.

The 1 , whether scalar (effectively an array with shape () ), is broadcast to (1,1) and then to (1000,1000), all of this without copying.

I'm certainly no numpy expert, but my guess would be that the first example uses only one vector and the second acctually creates a vector of 1 first, then subtracts. The latter requires double amount of memory and one extra step to create the vector of 1.

On x86 CPU both are probably some kind of AVX instructions which work on 4 numbers at a time. Unless of course you are using a fancy CPU with a SIMD-width larger than the length of your vector, and this CPU is supported by numpy.

Case A runs just one iterator at the mpu while case B has two iterators over two vectors as large as X which demands a load of context switching in the thread if not optimized. Case B is a more general version of case A...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM