简体   繁体   English

如何检查SSE中16位整数乘法的溢出?

[英]How to check overflow for multiplication of 16 bit integers in SSE?

I want to implement a simple function in SSE (a program like Izhikevich spiking neuron model ). 我想在SSE(像Izhikevich尖峰神经元模型这样的程序)中实现一个简单的函数。 It should work with 16 bit signed integers (8.8 fixed point) and it needs to check the overflow condition during some integration step, and set a SSE mask (if overflow occured): 它应该使用16位有符号整数(8.8固定点)并且它需要在某个积分步骤中检查溢出条件,并设置SSE掩码(如果发生溢出):

// initialized like following:
short I = 0x1BAD; // current injected to neuron
short vR = 0xF00D; // some reset threshold when spiked (negative)

// step to be vectorized:
short v0 = vReset;
for(;;) {

    // v0*v0/16 likely overflows => use 32 bit (16.16)
    short v0_sqr = ((int)v0)*((int)v0) / (1<<(8+4)); // not sure how "(v0*v0)>>(8+4)" would affect sign..
     // or   ((int)v0)*((int)v0) >> (8+4); // arithmetic right shift
     // original paper used v' = (v0^2)/25 + ...

    short v1 = v0_sqr + v0 + I;
    int m; // mask is set when neuron fires
    if(v1_overflows_during_this_operation()) { // "v1 > 0x7FFF" - way to detect?
        m=0xFFFFFFFF;
    else
        m=0;
    v0 = ( v1 & ~m ) | (vR & m );
}

But I haven't found the _mm_mul_epi16() instruction, to check high word of the multiplication. 但是我没有找到_mm_mul_epi16()指令来检查乘法的高位字。 Why, and how such task v1_overflows_during_this_operation() is supposed to be implemented in SSE? 为什么,以及如何在SSE中实现这样的任务v1_overflows_during_this_operation()

Unlike 32x32 => 64, there is no widening 16x16 -> 32 SSE multiplication instruction. 与32x32 => 64不同,没有扩展的16x16 - > 32 SSE乘法指令。

Instead, there's _mm_mulhi_epi16 and _mm_mulhi_epu16 which give you just the signed or unsigned upper half of the full result. 相反,有_mm_mulhi_epi16_mm_mulhi_epu16_mm_mulhi_epu16给你完整结果的有符号或无符号的上半部分。

(and _mm_mullo_epi16 , which does packed 16x16 => 16-bit low half truncating multiply, which is the same for signed or unsigned). (和_mm_mullo_epi16 ,它打包16x16 => 16位低半截断乘法,对于有符号或无符号相同)。

You could use _mm_unpacklo/hi_epi16 to interleave low/high halves into a pair of vectors with 32-bit elements, but that would be pretty slow. 您可以使用_mm_unpacklo/hi_epi16将低/高半部分交织成具有32位元素的一对向量,但这将非常慢。 But yes, you could _mm_srai_epi32(v, 8+4) arithmetic right-shift that by 12 and then re-pack, maybe with _mm_packs_epi32 (signed saturation back to 16-bit). 但是,是的,你可以_mm_srai_epi32(v, 8+4)算术右移12,然后重新打包,也许用_mm_packs_epi32 (签名饱和回到16位)。 Then I guess check for saturation? 然后我想检查饱和度?


Your use case is unusual. 你的用例很不寻常。 There's _mm_mulhrs_epi16 which gives you the high 17 bits, rounded off and then truncated to 16 bits. _mm_mulhrs_epi16 ,它给你高17位,四舍五入然后截断为16位。 (See the description). (见说明)。 That's useful for some fixed-point algorithms where inputs are scaled to put the result in the upper half, and you want to round including the low half instead of truncate. 这对于某些定点算法非常有用,在这些算法中,输入被缩放以将结果放在上半部分,并且您想要舍入包括低半部分而不是截断。

You might actually use _mm_mulhrs_epi16 or _mm_mulhi_epi16 as your best bet for keeping the most precision, maybe by left-shifting your v0 before squaring to just the point where the high half will give you (v0*v0) >> (8+4) . 您可能实际上使用_mm_mulhrs_epi16_mm_mulhi_epi16作为保持最高精度的最佳选择,可能通过左移你的v0然后平方到高半部分给你的点(v0*v0) >> (8+4)

So do you think it is easier not to allow result to overflow, and just to generate mask with _mm_cmpge_epi16(v1, vThreshold) as author does in the original paper? 所以你认为不允许结果溢出更简单,只是像原作中的作者那样用_mm_cmpge_epi16(v1, vThreshold)生成掩码?

Hell yes! 当然好! gaining another bit or two of precision would cost you maybe a factor of 2 in performance, because you'd have to compute another multiply result to check for overflow, or effectively widen to 32-bit (cutting the number of elements per vector in half), as described above. 获得另一个或两个精度会使你的性能成本降低2倍,因为你必须计算另一个乘法结果以检查溢出,或者有效地扩展到32位(将每个矢量的元素数量减少一半) ), 如上所述。

With a compare result, v0 = ( v1 & ~m ) | (vR & m ); 通过比较结果, v0 = ( v1 & ~m ) | (vR & m ); v0 = ( v1 & ~m ) | (vR & m ); becomes an SSE4.1 blend: _mm_blendv_epi8 . 成为SSE4.1混合: _mm_blendv_epi8


If your vThreshold has 2 unset bits at the top, you have room to left shift without losing any of the most-significant bits . 如果你的vThreshold在顶部有2个未设置的位,你有左移的空间而不会丢失任何最重要的位 Since mulhi gives you (v0*v0) >> 16 , so you can do this: 由于mulhi给你(v0*v0) >> 16 ,所以你可以这样做:

// losing the high 2 bits of v0
__m128i v0_lshift2   = _mm_slli_epi16(v0, 2);    // left by 2 before squaring
__m128i v0_sqr_asr12 = _mm_mulhi_epi16(v0_lshift2, v0_lshift2);
__m128i v1 = _mm_add_epi16(v0, I);
        v1 = _mm_add_epi16(v1, v0_sqr_asr12);

    // v1 = ((v0<<2)* (int)(v0<<2))) >> 16) + v0 + I

    // v1 = ((v0*(int)v0) >> 12) + v0 + I

Left shift by 2 before squaring is the same as left shift by 4 after squaring (of the full 32-bit result). 平方后左移2乘以平方后左移4(完整32位结果)。 It puts the 16 bits we want into the high 16 exactly. 它将我们想要的16位精确地放入高16位。

But this is unusable if your v0 is so close to full range that you'd potentially overflow when left-shifting. 但是如果你的v0非常接近全范围,那么在左移时你可能会溢出,这是不可用的。

Otherwise, you can lose 6 low bits of v0 before multiplying 否则,在乘法之前可能会丢失6个低位v0

Rounding towards -Infinity with an arithmetic right shift loses 6 bits of precision, but overflow is impossible. 使用算术右移向-Infinity舍入会丢失6位精度,但溢出是不可能的。

// losing the low 6 bits of v0
__m128i v0_asr6 = _mm_srai_epi16(v0, 6);
__m128i v0_sqr_asr12 = _mm_mullo_epi16(v0_asr6, v0_asr6);
__m128i v1 = _mm_add_epi16(v0, I);
        v1 = _mm_add_epi16(v1, v0_sqr_asr12);

    // v1 =  (v0>>6) * (int)(v0>>6)) + v0 + I

    // v1 ~= ((v0*(int)v0) >> 12) + v0 + I

I think you're losing more precision this way, so it's probably better to set vThreshold small enough that you have enough overhead to use high-half multiplies. 我认为你以这种方式失去了更多的精度,所以最好将vThreshold设置vThreshold足够小,以至于你有足够的开销来使用高半乘法。 This way includes maybe-worse rounding. 这种方式可能包括更糟糕的舍入。

pmulhrsw to round instead of truncate might be even better, if we can set up for it as efficiently. pmulhrsw to round而不是truncate可能会更好,如果我们可以有效地设置它。 But I don't think we can because the right-shift by 1 is an odd number. 但我认为我们不能,因为1的右移是一个奇数。 I think we'd need to make 2 separate inputs, one v0_lshift2 and one only left shifted by 1. 我想我们需要制作2个独立的输入,一个是v0_lshift2 ,另一个是左移1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM