[英]ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined
I'd like to be able to essentially be able to typecast a uint8x8_t
into a uint8x16_t
with no overhead, leaving the upper 64-bits undefined. 我希望能够从根本
uint8x8_t
转换为uint8x16_t
,而不会产生开销,而未定义高64位。 This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example: 如果您只关心底部的64位,但希望使用128位的指令,则此功能很有用,例如:
uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this uint8x16_t shifted = vextq_u8(oldData, data, 2);
From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register. 根据我对ARM汇编的理解,这应该是可能的,因为可以将负载分配给D寄存器,然后再解释为Q寄存器。
Some ways I can think of getting this working would be: 我可以想到的一些方法是:
data = vcombine_u8(vld1_u8(src), vdup_n_u8(0));
- compiler seems to go to the effort of setting the upper half to 0, even though this is never necessary data = vld1q_u8(src);
- doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units? I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value. 我想在CPU中可能会出现部分依赖的棘手情况,只设置了一半这样的寄存器,但是我宁愿编译器在这里找出最好的方法,而不是强迫它使用0值。
Is there any way to do this? 有什么办法吗?
On aarch32
, you are completely at the compiler's mercy on this. 在
aarch32
,您完全可以接受编译器的控制。 (That's why I write NEON routines in assembly) (这就是为什么我在汇编中编写NEON例程的原因)
On aarch64
on the other hand, it's pretty much automatic since the upper 64bit isn't directly accessible anyway. 另一方面,在
aarch64
上,由于几乎无法直接访问高64位,因此它几乎是自动的。
The compiler will execute trn1
instruction upon vcombine
though. 编译器将执行
trn1
于指令vcombine
虽然。
To sum it up, There is always overhead involved on aarch64
while it's unpredictable on aarch32
. 概括起来,总是有参与的开销上
aarch64
虽然这是无法预测的aarch32
。 If your aarch32
routine is simple and short, thus not many registers are necessary, chances are good that the compiler assigns the registers cleverly, but VERY unlikely otherwise. 如果您的
aarch32
例程简单而又短,因此不需要太多寄存器,则编译器很可能会巧妙地分配寄存器,否则非常不可能。
BTW, on aarch64
, if you initialize the lower 64bit, the CPU automatically sets the upper 64bit to zero. 顺便说一句,在
aarch64
,如果您初始化低64位,则CPU会自动将高64位设置为零。 I don't know if it costs extra time though. 不过,我不知道是否要花费额外的时间。 It did cost me several days until I found out what had been wrong all the time along.
确实花了我几天的时间,直到我一直发现问题出在哪里。 So annoying!!!
很烦人!!!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.