简体繁体 English

ARM NEON内部函数将D（64位）寄存器转换为Q（128位）寄存器的下半部分，而未定义上半部分

[英]ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined

原文 2017-10-24 12:36:32 8 1 c/ arm/ intrinsics/ neon

I'd like to be able to essentially be able to typecast a uint8x8_t into a uint8x16_t with no overhead, leaving the upper 64-bits undefined. 我希望能够从根本uint8x8_t转换为uint8x16_t ，而不会产生开销，而未定义高64位。 This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example: 如果您只关心底部的64位，但希望使用128位的指令，则此功能很有用，例如：

uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this uint8x16_t shifted = vextq_u8(oldData, data, 2);

From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register. 根据我对ARM汇编的理解，这应该是可能的，因为可以将负载分配给D寄存器，然后再解释为Q寄存器。

Some ways I can think of getting this working would be: 我可以想到的一些方法是：

data = vcombine_u8(vld1_u8(src), vdup_n_u8(0)); - compiler seems to go to the effort of setting the upper half to 0, even though this is never necessary -编译器似乎会尽力将上半部分设置为0，即使这从来没有必要
data = vld1q_u8(src); - doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units? -可以进行128位加载（在我的情况下很好），但是在具有64位NEON单元的处理器上运行起来可能会更慢？

I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value. 我想在CPU中可能会出现部分依赖的棘手情况，只设置了一半这样的寄存器，但是我宁愿编译器在这里找出最好的方法，而不是强迫它使用0值。

Is there any way to do this? 有什么办法吗？

1 个解决方案

On aarch32 , you are completely at the compiler's mercy on this. 在aarch32 ，您完全可以接受编译器的控制。 (That's why I write NEON routines in assembly) （这就是为什么我在汇编中编写NEON例程的原因）

On aarch64 on the other hand, it's pretty much automatic since the upper 64bit isn't directly accessible anyway. 另一方面，在aarch64上，由于几乎无法直接访问高64位，因此它几乎是自动的。

The compiler will execute trn1 instruction upon vcombine though. 编译器将执行trn1于指令vcombine虽然。

To sum it up, There is always overhead involved on aarch64 while it's unpredictable on aarch32 . 概括起来，总是有参与的开销上aarch64虽然这是无法预测的aarch32 。 If your aarch32 routine is simple and short, thus not many registers are necessary, chances are good that the compiler assigns the registers cleverly, but VERY unlikely otherwise. 如果您的aarch32例程简单而又短，因此不需要太多寄存器，则编译器很可能会巧妙地分配寄存器，否则非常不可能。

BTW, on aarch64 , if you initialize the lower 64bit, the CPU automatically sets the upper 64bit to zero. 顺便说一句，在aarch64 ，如果您初始化低64位，则CPU会自动将高64位设置为零。 I don't know if it costs extra time though. 不过，我不知道是否要花费额外的时间。 It did cost me several days until I found out what had been wrong all the time along. 确实花了我几天的时间，直到我一直发现问题出在哪里。 So annoying!!! 很烦人！！！