简体   繁体   English

与astype(int)相比,numpy / / rint变慢

[英]numpy around/rint slow compared to astype(int)

So if I have something like x=np.random.rand(60000)*400-200 . 所以,如果我有像x=np.random.rand(60000)*400-200 iPython's %timeit says: iPython的%timeit说:

  • x.astype(int) takes 0.14ms x.astype(int)需要0.14ms
  • np.rint(x) and np.around(x) take 1.01ms np.rint(x)np.around(x)需要1.01ms

Note that in the rint and around cases you still need to spend the extra 0.14ms to do a final astype(int) (assuming that's what you ultimately want). 请注意,在rintaround情况下,您仍然需要花费额外的0.14ms来做最终的astype(int) (假设这是你最终想要的)。

Question: am I right in thinking that most modern hardware is capable of doing both operations in equal time. 问题:我认为大多数现代硬件都能够在同等时间内完成两项操作。 If so, why is numpy taking 8 times longer for the rounding? 如果是这样,为什么numpy需要花费8倍的时间进行舍入?

As it happens I'm not super fussy about the exactness of the arithmetic, but I can't see how to take advantage of that with numpy (I'm doing messy biology not particle physics). 碰巧我对算术的准确性并不十分挑剔,但我看不出如何利用numpy的优势(我正在做杂乱的生物学而不是粒子物理学)。

np.around(x).astype(int) and x.astype(int) don't produce the same values. np.around(x).astype(int)x.astype(int)不会产生相同的值。 The former rounds even (it's the same as ((x*x>=0+0.5) + (x*x<0-0.5)).astype(int) ) whereas the latter rounds towards zero. 前者舍入均匀(它与((x*x>=0+0.5) + (x*x<0-0.5)).astype(int) )相同,而后者则向零((x*x>=0+0.5) + (x*x<0-0.5)).astype(int) However, 然而,

y = np.trunc(x).astype(int)
z = x.astype(int)

shows y==z but calculating y is much slower. 显示y==z但计算y要慢得多。 So it's the np.trunc and np.around functions which are slow. 所以这是np.truncnp.around函数很慢。

In [165]: x.dtype
Out[165]: dtype('float64')
In [168]: y.dtype
Out[168]: dtype('int64')

So np.trunc(x) rounds towards zero from double to double. 所以np.trunc(x)从double到double向零np.trunc(x)入。 Then astype(int) has to convert double to int64. 然后astype(int)必须将double转换为int64。

Internally I don't know what python or numpy are doing but I know how I would do this in C. Let's discuss some hardware. 在内部我不知道python或numpy正在做什么,但我知道我将如何在C中执行此操作。让我们讨论一些硬件。 With SSE4.1 it's possible to do round, floor, ceil, and trunc from double to double using: 使用SSE4.1,可以使用以下方法执行从double到double的round,floor,ceil和trunc:

_mm_round_pd(a, 0); //round: round even
_mm_round_pd(a, 1); //floor: round towards minus infinity
_mm_round_pd(a, 2); //ceil:  round towards positive infinity
_mm_round_pd(a, 3); //trunc: round towards zero

but numpy needs to support systems without SSE4.1 as well so it would have to build without SSE4.1 as well as with SSE4.1 and then use a dispatcher. 但numpy需要支持没有SSE4.1的系统,所以它必须在没有SSE4.1以及SSE4.1的情况下构建,然后使用调度程序。

But to do this from double directly to int64 using SSE/AVX is not efficient until AVX512. 但是直到使用SSE / AVX从双直接到int64这样做在AVX512之前效率不高。 However, it is possible to round double to int32 efficiently using only SSE2: 但是,只使用SSE2可以有效地将double舍入到int32:

_mm_cvtpd_epi32(a);  //round double to int32 then expand to int64
_mm_cvttpd_epi32(a); //trunc double to int32 then expand to int64

These converts two doubles to two int64. 这些将两个双精度转换为两个int64。

In your case this would work fine since the range is certainly within int32. 在你的情况下,这将工作正常,因为范围肯定在int32内。 But unless python knows the range fits in int32 it can't assume this so it would have to round or trunc to int64 which is slow. 但除非python知道范围适合int32,否则它不能假设这样,所以它必须舍入或截断到int64,这是缓慢的。 Also, once again numpy would have to build to support SSE2 to do this anyway. 此外,无论如何,numpy必须构建以支持SSE2来执行此操作。

But maybe you could have used a single floating point array to begin with. 但也许您可以使用单个浮点数组开始。 In that case you could have done: 在那种情况下你可以做到:

_mm_cvtps_epi32(a); //round single to int32
_mm_cvttps_epi32(a) //trunc single to int32

These convert four singles to four int32. 这些将四个单一转换为四个int32。

So to answer your question SSE2 can round or truncated from double to int32 efficiently. 因此,为了回答您的问题,SSE2可以有效地从double舍入或截断为int32。 AVX512 will be able to round or truncated from double to int64 efficiently as well using _mm512_cvtpd_epi64(a) or _mm512_cvttpd_epi64(a) . 使用_mm512_cvtpd_epi64(a)_mm512_cvttpd_epi64(a) AVX512也能够有效地从double到int64进行舍入或截断。 SSE4.1 can round/trunc/floor/ceil from float to float or double to double efficiently. SSE4.1可以将float / trunc / floor / ceil从float转为float或者double或double to double。

As pointed out by @jme in the comments, the rint and around functions must work out whether to round the fractions up or down to the nearest integer. 正如@jme在注释中指出的那样, rintaround函数必须确定是否将分数向上或向下舍入到最接近的整数。 In contrast, the astype function will always round down so it can immediately discard the decimal information. 相反, astype函数将始终向下舍入,因此它可以立即丢弃小数信息。 There are a number of other functions that do the same thing. 还有许多其他功能可以做同样的事情。 Also, you could improve the speed by using a lower number of bits for the integer. 此外,您可以通过使用较低的整数位来提高速度。 However you must be careful that you can accommodate the full range of your input data. 但是,您必须小心,您可以容纳所有输入数据。

%%timeit
np.int8(x)
10000 loops, best of 3: 165 µs per loop

Note, this does not store values outside the range -128 to 127 as it's 8-bit. 注意,这不会存储-128到127范围之外的值,因为它是8位。 Some values in your example fall outside this range. 示例中的某些值超出此范围。

Of all the others I tried, np.intc seems to be the fastest: 在我试过的所有其他np.intcnp.intc似乎是最快的:

%%timeit
np.int16(x)
10000 loops, best of 3: 186 µs per loop

%%timeit
np.intc(x)
10000 loops, best of 3: 169 µs per loop

%%timeit
np.int0(x)
10000 loops, best of 3: 170 µs per loop

%%timeit
np.int_(x)
10000 loops, best of 3: 188 µs per loop

%%timeit
np.int32(x)
10000 loops, best of 3: 187 µs per loop

%%timeit
    np.trunc(x)
1000 loops, best of 3: 940 µs per loop

Your examples, on my machine: 你的例子,在我的机器上:

%%timeit
np.around(x)
1000 loops, best of 3: 1.48 ms per loop

%%timeit
np.rint(x)
1000 loops, best of 3: 1.49 ms per loop

%%timeit
x.astype(int)
10000 loops, best of 3: 188 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM