简体   繁体   English

NumPy表现:uint8 vs.浮动和乘法与除法?

[英]NumPy performance: uint8 vs. float and multiplication vs. division?

I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division. 我刚刚注意到,通过仅将乘法变为除法,我的脚本的执行时间几乎减半。

To investigate this, I have written a small example: 为了研究这个,我写了一个小例子:

import numpy as np                                                                                                                                                                                
import timeit

# uint8 array
arr1 = np.random.randint(0, high=256, size=(100, 100), dtype=np.uint8)

# float32 array
arr2 = np.random.rand(100, 100).astype(np.float32)
arr2 *= 255.0


def arrmult(a):
    """ 
    mult, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) * 0.5

def arrmult2(a):
    """ 
    mult, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
        b[i, j] = (b[i, j] + 5) * 0.5

def arrmult3(a):
    """
    mult, vectorized
    """
    b = a.copy()
    b = (b + 5) * 0.5

def arrdiv(a):
    """ 
    div, read-write iterator 
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) / 2

def arrdiv2(a):
    """ 
    div, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
           b[i, j] = (b[i, j] + 5)  / 2                                                                                 

def arrdiv3(a):                                                                                                     
    """                                                                                                             
    div, vectorized                                                                                                 
    """                                                                                                             
    b = a.copy()                                                                                                    
    b = (b + 5) / 2                                                                                               




def print_time(name, t):                                                                                            
    print("{: <10}: {: >6.4f}s".format(name, t))                                                                    

timeit_iterations = 100                                                                                             

print("uint8 arrays")                                                                                               
print_time("arrmult", timeit.timeit("arrmult(arr1)", "from __main__ import arrmult, arr1", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr1)", "from __main__ import arrmult2, arr1", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr1)", "from __main__ import arrmult3, arr1", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr1)", "from __main__ import arrdiv, arr1", number=timeit_iterations))  
print_time("arrdiv2", timeit.timeit("arrdiv2(arr1)", "from __main__ import arrdiv2, arr1", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr1)", "from __main__ import arrdiv3, arr1", number=timeit_iterations))

print("\nfloat32 arrays")                                                                                           
print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr2)", "from __main__ import arrmult2, arr2", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr2)", "from __main__ import arrmult3, arr2", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))  
print_time("arrdiv2", timeit.timeit("arrdiv2(arr2)", "from __main__ import arrdiv2, arr2", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr2)", "from __main__ import arrdiv3, arr2", number=timeit_iterations))

This prints the following timings: 这将打印以下时间:

uint8 arrays
arrmult   : 2.2004s
arrmult2  : 3.0589s
arrmult3  : 0.0014s
arrdiv    : 1.1540s
arrdiv2   : 2.0780s
arrdiv3   : 0.0027s

float32 arrays
arrmult   : 1.2708s
arrmult2  : 2.4120s
arrmult3  : 0.0009s
arrdiv    : 1.5771s
arrdiv2   : 2.3843s
arrdiv3   : 0.0009s

I always thought a multiplication is computationally cheaper than a division. 我一直认为乘法在计算上比分裂便宜。 However, for uint8 a division seems to be nearly twice as effective. 然而,对于uint8一个部门似乎有效率几乎是其两倍。 Does this somehow relate to the fact, that * 0.5 has to calculate the multiplication in a float and then casting the result back to to an integer? 这是否与某事实有关, * 0.5必须计算浮点数中的乘法,然后将结果转换回整数?

At least for floats multiplications seem to be faster than divisions. 至少对于浮点数乘法似乎比除法更快。 Is this generally true? 这一般是正确的吗?

Why is a multiplication in uint8 more expansive than in float32 ? 为什么uint8的乘法比float32扩展更广泛? I thought an 8-bit unsigned integer should be much faster to calculate than 32-bit floats?! 我认为8位无符号整数的计算速度要比32位浮点数快得多?!

Can someone "demystify" this? 有人可以“神秘化”这个吗?

EDIT : to have more data, I've included vectorized functions (like suggested) and added index iterators as well. 编辑 :为了获得更多数据,我已经包含了矢量化函数(如建议的)和添加的索引迭代器。 The vectorized functions are much faster, thus not really comparable. 矢量化函数要快得多,因此无法真正比​​较。 However, if timeit_iterations is set much higher for the vectorized functions, it turns out that multiplication is faster for both, uint8 and float32 . 但是,如果向量化函数的timeit_iterations设置得更高,则证明uint8float32乘法运算速度更快。 I guess this confuses even more?! 我想这会让人更加困惑?!

Maybe multiplication is in fact always faster than division, but the main performance leaks in the for-loops is not the arithmetical operation, but the loop itself. 也许乘法实际上总是快于除法,但for循环中的主要性能泄漏不是算术运算,而是循环本身。 Although this does not explain why the loops behave differently for different operations. 虽然这并不能解释为什么循环对于不同的操作表现不同。

EDIT2 : Like @jotasi already stated, we are looking for a full explanation of division vs. multiplication and int (or uint8 ) vs. float (or float32 ). EDIT2 :就像@jotasi已经说过的那样,我们正在寻找divisionmultiplicationint (或uint8 )与float (或float32 )的完整解释。 Additionally, explaining the different trends of the vectorized approaches and the iterators would be interesting, as in the vectorized case, the division seems to be slower, whereas it is faster in the iterator case. 另外,解释向量化方法和迭代器的不同趋势将是有趣的,因为在向量化的情况下,除法似乎更慢,而在迭代器情况下它更快。

The problem is your assumption, that you measure the time needed for division or multiplication, which is not true. 问题是你的假设,即你测量分裂或乘法所需的时间,这是不正确的。 You are measuring the overhead needed for a division or multiplication. 您正在测量除法或乘法所需的开销。

One has really to look at the exact code to explain every effect, which can vary from version to version. 人们真的要查看确切的代码来解释每种效果,这些效果因版本而异。 This answer can only give an idea, what one has to consider. 这个答案只能给出一个想法,一个人必须考虑的问题。

The problem is that a simple int is not simple at all in python: it is a real object which must be registered in the garbage collector, it grows in size with its value - for all that you have to pay: for example for a 8bit integer 24 bytes memory are needed! 问题是在python中一个简单的int根本不简单:它是一个必须在垃圾收集器中注册的真实对象,它的大小随着它的值而增长 - 对于你需要支付的所有内容:例如8bit需要整数24字节的内存! similar goes for python-floats. 类似于python-floats。

On the other hand, a numpy array consists of simple c-style integers/floats without overhead, you save a lot of memory, but pay for it during the access to an element of numpy-array. 另一方面,numpy数组由简单的c样式整数/浮点数组成,没有开销,你节省了大量内存,但在访问numpy-array元素时付出了代价。 a[i] means: a python-integer must be constructed, registered in the garbage collector and only than it can be used - there is a lot of overhead. a[i]表示:必须构造一个python-integer,在垃圾收集器中注册,而且只能使用它 - 有很多开销。

Consider this code: 考虑以下代码:

li1=[x%256 for x in xrange(10**4)]
arr1=np.array(li1, np.uint8)

def arrmult(a):    
    for i in xrange(len(a)):
        a[i]*=5;

arrmult(li1) is 25 faster than arrmult(arr1) because integers in the list are already python-ints and don't have to be created! arrmult(li1)arrmult(arr1)快25,因为列表中的整数已经是python-ints而不必创建! The lion's share of the calculation time is needed for creation of the objects - everything else can be almost neglected. 创造物体需要大部分计算时间 - 其他一切都几乎可以忽略不计。


Let's take a look at your code, first the multiplication: 我们来看看你的代码,首先是乘法:

def arrmult2(a):
    ...
    b[i, j] = (b[i, j] + 5) * 0.5

In the case of the uint8 the following must happen (I neglect +5 for simplicity): 在uint8的情况下,必须发生以下情况(为简单起见,我忽略了+5):

  1. a python-int must be created 必须创建一个python-int
  2. it must be casted to a float (python-float creation), in order to be able to do float multiplication 它必须被转换为float(python-float创建),以便能够进行浮点乘法
  3. and casted back to a python-int or/and uint8 并转换回python-int或/和uint8

For float32, there is less work to do (multiplication does not cost much): 1. a python-float created 2. casted back float32. 对于float32,可以做的工作量较少(乘法不会花费太多):1。创建了一个python-float 2.使用后面的float32。

So the float-version should be faster and it is. 所以float-version应该更快,它就是。


Now let's take a look at the division: 现在让我们来看看这个部门:

def arrdiv2(a):
    ...
    b[i, j] = (b[i, j] + 5)  / 2 

The pitfall here: All operations are integer-operations. 这里的陷阱:所有操作都是整数运算。 So compared to multiplication there is no need to cast to python-float, thus we have less overhead as in the case of multiplication. 因此,与乘法相比,不需要转换为python-float,因此我们在乘法的情况下具有更少的开销。 Division is "faster" for unint8 than multiplication in your case. 对于unint8,除法在你的情况下比乘法“更快”。

However, division and multiplication are equally fast/slow for float32, because almost nothing has changed in this case - we still need to create a python-float. 但是,float32的除法和乘法同样快/慢,因为在这种情况下几乎没有任何改变 - 我们仍然需要创建一个python-float。


Now the vectorized versions: they work with c-style "raw" float32s/uint8s without conversion (and its cost!) to the corresponding python-objects under the hood. 现在是矢量化版本:它们使用c风格的“raw”float32s / uint8s而无需转换(及其成本!)到引擎盖下的相应python-objects。 To get meaningful results you should increase the number of iteration (right now the running time is too small to say something with certainty). 为了获得有意义的结果,你应该增加迭代次数(现在运行时间太短,无法确定地说出来)。

  1. division and multiplication for float32 could have the same running time, because I would expect numpy to replace the division by 2 through multiplication by 0.5 (but to be sure one has to look into the code). float32的除法和乘法可以具有相同的运行时间,因为我希望numpy通过乘以0.5来将除法替换为2(但是要确保必须查看代码)。

  2. multiplication for uint8 should be slower, because every uint8-integer must be casted to a float prior to multiplication with 0.5 and than casted back to uint8 afterwards. uint8的乘法应该更慢,因为每个uint8整数必须在乘以0.5之前被转换为浮点数,然后再转换为uint8。

  3. for the uint8 case, the numpy cannot replace the division by 2 through multiplication with 0.5 because it is an integer division. 对于uint8的情况,numpy不能通过乘以0.5来取代除以2,因为它是整数除法。 Integer division is slower than float-multiplication for a lot of architectures - this is the slowest vectorized operation. 对于许多体系结构,整数除法比浮点乘法慢 - 这是最慢的向量化操作。


PS: I would not dwell too much about costs multiplication vs. division - there are too many other things that can have a bigger hit on the performance. PS:我不会过多谈论成本增加与分裂 - 还有太多其他事情会对性能产生更大影响。 For example creating unnecessary temporary objects or if the numpy-array is large and does not fit into the cache, than the memory access will be the bottle-neck - you will see no difference between multiplication and division at all. 例如,创建不必要的临时对象,或者如果numpy-array很大并且不适合缓存,那么内存访问将是瓶颈 - 你将看到乘法和除法之间没有区别。

This answer only looks at vectorised operations, as the reason for the other operations being slow has been answered by ead . 这个答案只关注矢量化操作,因为其他操作缓慢的原因已由ead回答。

A lot of "optimisations" are based on old hardware. 许多“优化”都基于旧硬件。 The assumptions that meant that optimisations held true on older hardware do not old true on newer hardware. 这些假设意味着在旧硬件上实现优化并不适用于较新的硬件。

Pipelines and division 管道和部门

Division is slow. 分工慢。 Division operations consist of several units that each have to perform one calculation one after another. 分部操作由几个单元组成,每个单元必须一个接一个地执行一个计算。 This is what makes division slow. 这就是分裂缓慢的原因。

However, in a floating-point processing unit (FPU) [common on most modern CPUs] there are dedicated units arranged in a "pipeline" for the division instruction. 然而,在浮点处理单元(FPU)[在大多数现代CPU上通用],存在布置在用于划分指令的“流水线”中的专用单元。 Once a unit is done, that unit isn't needed for the rest of the operation. 一旦完成一个单元,其余操作就不需要该单元。 If you have several division operations you can get these units with nothing to do started on the next division operation. 如果你有几个除法运算,你就可以在下一个除法运算中得到这些单位。 So though each operation is slow, the FPU can actually achieve a high throughput of division operations. 因此,虽然每个操作都很慢,但FPU实际上可以实现高吞吐量的除法运算。 Pipeline-ing isn't the same as vectorisation, but the results are mostly the same -- higher throughput when you have lots of the same operations to do. 管道传输与矢量化不同,但结果大致相同 - 当您有许多相同的操作时,吞吐量会更高。

Think of pipeline-ing like traffic. 想想像流量这样的管道。 Compare three lanes of traffic moving at 30 mph versus one lane of traffic moving at 90 mph. 比较以30英里/小时的速度行驶的三条车道与一条以90英里/小时的速度行驶的车道。 The slower traffic is definitely slower individually, but the three-lane-road still has the same throughput. 较慢的流量肯定会单独放慢,但三车道仍然具有相同的吞吐量。

It's because you multiply an int by a float and store the result as an int. 这是因为你将一个int乘以一个浮点数并将结果存储为一个int。 Try your arr_mult and arr_div tests with different integer or float values for the multiplication / division. 尝试使用不同的整数或浮点值进行arr_mult和arr_div测试以进行乘法/除法。 Especially, compare multiplying by '2' and multiplying by '2.' 特别是,比较乘以'2'并乘以'2'。

It's the very first operation that will typically take longer before "warming up" (eg memory allocated, caching). 这是第一次操作,通常需要更长时间才能“预热”(例如,分配内存,缓存)。

See the same effect using the reverse order of dividing and multiplying: 使用相反的分割和相乘顺序查看相同的效果:

>>> print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))
>>> print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))

arrdiv:  3.2630s
arrmult:  2.5873s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM