简体   繁体   English

为什么与使用两个 Numpy arrays 的向量化相比,使用 Numpy 数组和 int 进行算术运算时减法更快?

[英]Why is subtraction faster when doing arithmetic with a Numpy array and a int compared to using vectorization with two Numpy arrays?

I am confused as to why this code:我对为什么这段代码感到困惑:

start = time.time()
for i in range(1000000):
    _ = 1 - np.log(X)
print(time.time()-start)

Executes faster than this implementation:执行速度比这个实现快:

start = time.time()
for i in range(1000000):
    _ = np.subtract(np.ones_like(X), np.log(X))
print(time.time()-start)

My understanding was that it should be the opposite, as in the second implementation I'm utilizing the speed-up provided by vectorization, since it's able to operate the elements in X simultaneously rather than going sequentially, which is how I assumed the first implementation functions.我的理解是它应该是相反的,因为在第二个实现中我正在利用矢量化提供的加速,因为它能够同时操作 X 中的元素而不是顺序操作,这就是我假设第一个实现的方式功能。

Can someone shed some light on this for me, as I am genuinely confused?有人可以为我解释一下吗,因为我真的很困惑? Thank you!谢谢!

Both versions of your code are equally vectorized.您的代码的两个版本都是矢量化的。 The array you created to try to vectorize the second version is just overhead.您为尝试向量化第二个版本而创建的数组只是开销。


NumPy vectorization doesn't refer to hardware vectorization. NumPy 矢量化不是指硬件矢量化。 If the compiler is smart enough, it might end up using hardware vectorization, but NumPy doesn't explicitly use AVX or anything.如果编译器足够聪明,它可能最终会使用硬件矢量化,但 NumPy 并没有明确使用 AVX 或任何东西。

NumPy vectorization refers to writing Python-level code that operates on entire arrays at once, not using hardware instructions that operate on multiple operands at once. NumPy 向量化是指编写一次对整个 arrays 进行操作的 Python 级代码,而不是使用一次对多个操作数进行操作的硬件指令。 It's vectorization at the Python level, not at the machine language level.它是 Python 级别的向量化,而不是机器语言级别的向量化。 The benefit of this over writing explicit loops is that NumPy can perform the work in C-level loops instead of Python, avoiding a massive amount of dynamic dispatch, boxing, unboxing, trips through the bytecode evaluation loop, etc.这种写显式循环的好处是 NumPy 可以在 C 级循环中执行工作,而不是 Python,避免了大量的动态调度、装箱、拆箱、通过字节码评估循环等。

Both versions of your code are vectorized in that sense, but the second one wastes a bunch of memory and memory bandwidth on writing and reading a giant array of ones.从这个意义上说,您的代码的两个版本都是矢量化的,但是第二个版本在写入和读取巨大的数组时浪费了一堆 memory 和 memory 带宽。

Also, even if we were talking about hardware-level vectorization, the 1 - version would be just as amenable to hardware-level vectorization as the other version.此外,即使我们谈论的是硬件级矢量化, 1 -版本也与其他版本一样适用于硬件级矢量化。 You would just load the scalar 1 into all positions of a vector register and proceed as normal.您只需将标量1加载到向量寄存器的所有位置并照常进行。 It would involve far less transfers to and from memory than the second version, thus still probably running faster than the second version.与第二个版本相比,它涉及到 memory 的传输要少得多,因此可能仍然比第二个版本运行得更快。

Times are substantially the same.时间基本相同。 As others point out, there's isn't any sort of hardware, or multicore parallelization, just a mix of interpreted Python and compiled numpy functions.正如其他人指出的那样,没有任何类型的硬件或多核并行化,只是解释的 Python 和编译的numpy函数的混合。

In [289]: x = np.ones((1000,1000))

In [290]: timeit 1-np.log(x)                                                    
15 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [291]: timeit np.subtract(np.ones_like(x), np.log(x))                        
18.6 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Take the np.ones_like out the timing loop:np.ones_like排除在计时循环之外:

In [292]: %%timeit y = np.ones_like(x) 
     ...: np.subtract(y,np.log(x)) 
     ...:  
     ...:                                                                       
15.7 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2/3 of the time is spent in the log function: 2/3 的时间花在log function 中:

In [303]: timeit np.log(x)                                                      
10.7 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [306]: %%timeit y=np.log(x) 
     ...: np.subtract(1, y)                                                                  
3.77 ms ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The variations in how the 1 is generated are minor part of the timing.如何生成1的变化只是时序的一小部分。

With 'broadcasting', it's just as easy to do math with a scalar and array, or an array and array.使用“广播”,使用标量和数组或数组和数组进行数学运算同样容易。

The 1 , whether scalar (effectively an array with shape () ), is broadcast to (1,1) and then to (1000,1000), all of this without copying. 1 ,无论是标量(实际上是一个形状为()的数组),都被广播到 (1,1) 然后到 (1000,1000),所有这些都没有复制。

I'm certainly no numpy expert, but my guess would be that the first example uses only one vector and the second acctually creates a vector of 1 first, then subtracts.我当然不是 numpy 专家,但我的猜测是第一个例子只使用一个向量,第二个例子首先创建了一个向量 1,然后减去。 The latter requires double amount of memory and one extra step to create the vector of 1.后者需要双倍数量的 memory 和一个额外的步骤来创建向量 1。

On x86 CPU both are probably some kind of AVX instructions which work on 4 numbers at a time.在 x86 CPU 上,两者可能都是某种 AVX 指令,一次处理 4 个数字。 Unless of course you are using a fancy CPU with a SIMD-width larger than the length of your vector, and this CPU is supported by numpy.当然,除非您使用的是 SIMD 宽度大于矢量长度的精美 CPU,并且此 CPU 受 numpy 支持。

Case A runs just one iterator at the mpu while case B has two iterators over two vectors as large as X which demands a load of context switching in the thread if not optimized.案例 A 在 mpu 上仅运行一个迭代器,而案例 B 在两个与 X 一样大的向量上具有两个迭代器,如果未优化,则需要在线程中进行大量上下文切换。 Case B is a more general version of case A...案例 B 是案例 A 的更通用版本...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM