简体   繁体   English

按字节填充xmm寄存器

[英]Filling xmm register bytewise

I need to calculate the average of 32 uint8t values stored in one array. 我需要计算存储在一个数组中的32个uint8t值的平均值。 For performance reasons I wanted to change the code below to use the pavgb command and the xmm registers. 出于性能原因,我想更改以下代码以使用pavgb命令和xmm寄存器。 The problem is that I cannot copy 16 Byte at once using movdqu because I do some calculations within a loop to get the values to average. 问题是我无法使用movdqu一次复制16字节,因为我在循环中进行了一些计算以获得平均值。 The code below is a simplified version of the actual code I'm using. 下面的代码是我正在使用的实际代码的简化版本。

;
; void average(uint8_t *res, uint8_t *input)
;    rdi = res   | res holds 16 values
;    rsi = input | input holds 32 values
;
segment .text
    global average

average:    
    mov rcx, 0
    xor rax, rax
    xor rbx, rbx
.loop
    mov al, [rsi + rcx]
    cmp al, 16
    jge .endif
    add al, 16

    .endif
    mov bl, [rsi + rcx + 16]
    cmp bl, 16
    jge .endif2
    add bl, 16

    .endif2
    add ax, bl
    shr ax, 1

    mov [rdi], al

    inc rdi
    inc rsi
    inc rcx

    cmp rcx, 16
    jl .loop 

So to change the code to work with the xmm registers so I can do something like that in the end: 因此,更改代码以与xmm寄存器一起使用,以便最终可以做类似的事情:

pavgb   xmm0, xmm1
movdqu  [rdi], xmm0

I need to fill the xmm0 and xmm1 register bytewise. 我需要按字节填充xmm0和xmm1寄存器。 Is there a way to make this work? 有没有办法使这项工作?

There's not really any point going to using the pavgb instruction, since the extra work you need to do to set up the pavgb far exceeds the performance benefit of using pavgb in the first place. 使用pavgb指令实际上没有任何意义,因为设置pavgb所需的额外工作远远超出了首先使用pavgb的性能优势。 Your existing code is fine. 您现有的代码很好。

Even with an optimized SSE version, the function is so short that the performance will probably be swamped by the function call overhead. 即使使用了优化的SSE版本,该功能仍然太短,以至于性能可能会被函数调用开销所淹没。

To get a performance win, you probably need to use intrinsics so that the compiler can understand the code and incorporate it into its own optimizations (eg, inlining). 为了获得性能优势,您可能需要使用内部函数,以便编译器可以理解代码并将其合并到自己的优化中(例如,内联)。

void average(uint8_t *res, uint8_t *input)
{
    auto boundary = __m128i _mm_set1_epi8(0x10);

    // Process the first half
    auto part1 = _mm_loadu_si128((__m128i *)input);
    auto adjust1 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part1), boundary);
    auto adjusted1 = _mm_add_epi8(part1, adjust1);

   // process the second half
    auto part2 = _mm_loadu_si128((__m128i *)(input + 16);
    auto adjust2 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part2), boundary);
    auto adjusted2 = _mm_add_epi8(part2, adjust2);

   // average them together
   auto result = _mm_avg_epu8(adjusted1, adjusted2);

   // save the answer
   _mm_storeu_si128((__m128i *)res, result);
}

For better performance, you probably want the function to return a __m128i directly so that the caller can compute with it immediately rather than have to read the result out of memory. 为了获得更好的性能,您可能希望函数直接返回__m128i ,以便调用者可以立即使用它进行计算,而不必从内存中读取结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM