简体   繁体   English

SSE2打包8位整数有符号乘法(高半):将m128i(16x8位)分解为两个m128i(每个8x16)并重新打包

[英]SSE2 packed 8-bit integer signed multiply (high-half): Decomposing a m128i (16x8 bit) into two m128i (8x16 each) and repack

I'm trying to multiply two m128i byte per byte (8 bit signed integers). 我试图将每字节两个m128i字节乘以(8位有符号整数)。

The problem here is overflow. 这里的问题是溢出。 My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers. 我的解决方案是将这些8位有符号整数存储到16位有符号整数中,然后将整个事物打包成16 x 8位整数的m128i

Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made: 这是我做的__m128i mulhi_epi8(__m128i a, __m128i b)仿真:

inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
    auto a_decomposed = decompose_epi8(a);
    auto b_decomposed = decompose_epi8(b);

    __m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
    __m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);

    return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}

decompose_epi8 is implemented in a non-simd way: decompose_epi8以非simd方式实现:

inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
    std::pair<__m128i, __m128i> result;

    // result.first     =>  should contain 8 shorts in [-128, 127] (8 first bytes of the input)
    // result.second    =>  should contain 8 shorts in [-128, 127] (8 last bytes of the input)

    for (int i = 0; i < 8; ++i)
    {
        result.first.m128i_i16[i]   = input.m128i_i8[i];
        result.second.m128i_i16[i]  = input.m128i_i8[i + 8];
    }

    return result;
}

This code works well. 这段代码效果很好。 My goal now is to implement a simd version of this for loop. 我现在的目标是实现这个for循环的simd版本。 I looked at the Intel Intrinsics Guide but I can't find a way to do this. 我查看了英特尔内部指南,但我找不到办法做到这一点。 I guess shuffle could do the trick but I have trouble conceptualising this. 我想shuffle可以做到这一点,但我很难理解这一点。

As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. 如果你想进行有符号乘法,你需要将每个字节符号扩展为16位字,或者将它们移动到每个16位字的上半部分。 Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. 由于之后将结果打包在一起,您可以将输入拆分为奇数和偶数字节,而不是上半部分和下半部分。 Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16 ). 然后奇数字节的符号扩展可以通过算术地将所有16位部分向右移位来完成。 您可以通过屏蔽偶数字节来提取奇数字节,并且为了获得偶数字节,您可以将所有16位部分向左移位(两者都需要乘以_mm_mulhi_epi16 )。

The following should work with SSE2: 以下内容适用于SSE2:

__m128i mulhi_epi8(__m128i a, __m128i b)
{
    __m128i mask = _mm_set1_epi16(0xff00);
    // mask higher bytes:
    __m128i a_hi = _mm_and_si128(a, mask);
    __m128i b_hi = _mm_and_si128(b, mask);

    __m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
    // mask out garbage in lower half:
    r_hi = _mm_and_si128(r_hi, mask);

    // shift lower bytes to upper half
    __m128i a_lo = _mm_slli_epi16(a,8);
    __m128i b_lo = _mm_slli_epi16(b,8);
    __m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
    // shift result to the lower half:
    r_lo = _mm_srli_epi16(r_lo,8);

    // join result and return:
    return _mm_or_si128(r_hi, r_lo);
}

Note: a previous version used shifts to sign-extend the odd bytes. 注意:先前版本使用shift来对奇数字节进行符号扩展。 On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). 在大多数英特尔CPU上,这会增加P0的使用(也需要用于乘法)。 Bit-logic can operate on more ports, so this version should have better throughput. 位逻辑可以在更多端口上运行,因此该版本应该具有更好的吞吐量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM