简体   繁体   中英

SSE2 packed 8-bit integer signed multiply (high-half): Decomposing a m128i (16x8 bit) into two m128i (8x16 each) and repack

I'm trying to multiply two m128i byte per byte (8 bit signed integers).

The problem here is overflow. My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers.

Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made:

inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
    auto a_decomposed = decompose_epi8(a);
    auto b_decomposed = decompose_epi8(b);

    __m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
    __m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);

    return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}

decompose_epi8 is implemented in a non-simd way:

inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
    std::pair<__m128i, __m128i> result;

    // result.first     =>  should contain 8 shorts in [-128, 127] (8 first bytes of the input)
    // result.second    =>  should contain 8 shorts in [-128, 127] (8 last bytes of the input)

    for (int i = 0; i < 8; ++i)
    {
        result.first.m128i_i16[i]   = input.m128i_i8[i];
        result.second.m128i_i16[i]  = input.m128i_i8[i + 8];
    }

    return result;
}

This code works well. My goal now is to implement a simd version of this for loop. I looked at the Intel Intrinsics Guide but I can't find a way to do this. I guess shuffle could do the trick but I have trouble conceptualising this.

As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16 ).

The following should work with SSE2:

__m128i mulhi_epi8(__m128i a, __m128i b)
{
    __m128i mask = _mm_set1_epi16(0xff00);
    // mask higher bytes:
    __m128i a_hi = _mm_and_si128(a, mask);
    __m128i b_hi = _mm_and_si128(b, mask);

    __m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
    // mask out garbage in lower half:
    r_hi = _mm_and_si128(r_hi, mask);

    // shift lower bytes to upper half
    __m128i a_lo = _mm_slli_epi16(a,8);
    __m128i b_lo = _mm_slli_epi16(b,8);
    __m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
    // shift result to the lower half:
    r_lo = _mm_srli_epi16(r_lo,8);

    // join result and return:
    return _mm_or_si128(r_hi, r_lo);
}

Note: a previous version used shifts to sign-extend the odd bytes. On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). Bit-logic can operate on more ports, so this version should have better throughput.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM