简体   繁体   中英

Filling xmm register bytewise

I need to calculate the average of 32 uint8t values stored in one array. For performance reasons I wanted to change the code below to use the pavgb command and the xmm registers. The problem is that I cannot copy 16 Byte at once using movdqu because I do some calculations within a loop to get the values to average. The code below is a simplified version of the actual code I'm using.

;
; void average(uint8_t *res, uint8_t *input)
;    rdi = res   | res holds 16 values
;    rsi = input | input holds 32 values
;
segment .text
    global average

average:    
    mov rcx, 0
    xor rax, rax
    xor rbx, rbx
.loop
    mov al, [rsi + rcx]
    cmp al, 16
    jge .endif
    add al, 16

    .endif
    mov bl, [rsi + rcx + 16]
    cmp bl, 16
    jge .endif2
    add bl, 16

    .endif2
    add ax, bl
    shr ax, 1

    mov [rdi], al

    inc rdi
    inc rsi
    inc rcx

    cmp rcx, 16
    jl .loop 

So to change the code to work with the xmm registers so I can do something like that in the end:

pavgb   xmm0, xmm1
movdqu  [rdi], xmm0

I need to fill the xmm0 and xmm1 register bytewise. Is there a way to make this work?

There's not really any point going to using the pavgb instruction, since the extra work you need to do to set up the pavgb far exceeds the performance benefit of using pavgb in the first place. Your existing code is fine.

Even with an optimized SSE version, the function is so short that the performance will probably be swamped by the function call overhead.

To get a performance win, you probably need to use intrinsics so that the compiler can understand the code and incorporate it into its own optimizations (eg, inlining).

void average(uint8_t *res, uint8_t *input)
{
    auto boundary = __m128i _mm_set1_epi8(0x10);

    // Process the first half
    auto part1 = _mm_loadu_si128((__m128i *)input);
    auto adjust1 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part1), boundary);
    auto adjusted1 = _mm_add_epi8(part1, adjust1);

   // process the second half
    auto part2 = _mm_loadu_si128((__m128i *)(input + 16);
    auto adjust2 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part2), boundary);
    auto adjusted2 = _mm_add_epi8(part2, adjust2);

   // average them together
   auto result = _mm_avg_epu8(adjusted1, adjusted2);

   // save the answer
   _mm_storeu_si128((__m128i *)res, result);
}

For better performance, you probably want the function to return a __m128i directly so that the caller can compute with it immediately rather than have to read the result out of memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM