[英]Filling xmm register bytewise
I need to calculate the average of 32 uint8t
values stored in one array. 我需要计算存储在一个数组中的32个
uint8t
值的平均值。 For performance reasons I wanted to change the code below to use the pavgb
command and the xmm registers. 出于性能原因,我想更改以下代码以使用
pavgb
命令和xmm寄存器。 The problem is that I cannot copy 16 Byte at once using movdqu
because I do some calculations within a loop to get the values to average. 问题是我无法使用
movdqu
一次复制16字节,因为我在循环中进行了一些计算以获得平均值。 The code below is a simplified version of the actual code I'm using. 下面的代码是我正在使用的实际代码的简化版本。
;
; void average(uint8_t *res, uint8_t *input)
; rdi = res | res holds 16 values
; rsi = input | input holds 32 values
;
segment .text
global average
average:
mov rcx, 0
xor rax, rax
xor rbx, rbx
.loop
mov al, [rsi + rcx]
cmp al, 16
jge .endif
add al, 16
.endif
mov bl, [rsi + rcx + 16]
cmp bl, 16
jge .endif2
add bl, 16
.endif2
add ax, bl
shr ax, 1
mov [rdi], al
inc rdi
inc rsi
inc rcx
cmp rcx, 16
jl .loop
So to change the code to work with the xmm registers so I can do something like that in the end: 因此,更改代码以与xmm寄存器一起使用,以便最终可以做类似的事情:
pavgb xmm0, xmm1
movdqu [rdi], xmm0
I need to fill the xmm0 and xmm1 register bytewise. 我需要按字节填充xmm0和xmm1寄存器。 Is there a way to make this work?
有没有办法使这项工作?
There's not really any point going to using the pavgb
instruction, since the extra work you need to do to set up the pavgb
far exceeds the performance benefit of using pavgb
in the first place. 使用
pavgb
指令实际上没有任何意义,因为设置pavgb
所需的额外工作远远超出了首先使用pavgb
的性能优势。 Your existing code is fine. 您现有的代码很好。
Even with an optimized SSE version, the function is so short that the performance will probably be swamped by the function call overhead. 即使使用了优化的SSE版本,该功能仍然太短,以至于性能可能会被函数调用开销所淹没。
To get a performance win, you probably need to use intrinsics so that the compiler can understand the code and incorporate it into its own optimizations (eg, inlining). 为了获得性能优势,您可能需要使用内部函数,以便编译器可以理解代码并将其合并到自己的优化中(例如,内联)。
void average(uint8_t *res, uint8_t *input)
{
auto boundary = __m128i _mm_set1_epi8(0x10);
// Process the first half
auto part1 = _mm_loadu_si128((__m128i *)input);
auto adjust1 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part1), boundary);
auto adjusted1 = _mm_add_epi8(part1, adjust1);
// process the second half
auto part2 = _mm_loadu_si128((__m128i *)(input + 16);
auto adjust2 = _mm_and_si128(_mm_pcmpgt_epi8(boundary, part2), boundary);
auto adjusted2 = _mm_add_epi8(part2, adjust2);
// average them together
auto result = _mm_avg_epu8(adjusted1, adjusted2);
// save the answer
_mm_storeu_si128((__m128i *)res, result);
}
For better performance, you probably want the function to return a __m128i
directly so that the caller can compute with it immediately rather than have to read the result out of memory. 为了获得更好的性能,您可能希望函数直接返回
__m128i
,以便调用者可以立即使用它进行计算,而不必从内存中读取结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.