简体   繁体   English

ARM NEON内部函数将D(64位)寄存器转换为Q(128位)寄存器的下半部分,而未定义上半部分

[英]ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined

I'd like to be able to essentially be able to typecast a uint8x8_t into a uint8x16_t with no overhead, leaving the upper 64-bits undefined. 我希望能够从根本uint8x8_t转换为uint8x16_t ,而不会产生开销,而未定义高64位。 This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example: 如果您只关心底部的64位,但希望使用128位的指令,则此功能很有用,例如:

uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this uint8x16_t shifted = vextq_u8(oldData, data, 2);

From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register. 根据我对ARM汇编的理解,这应该是可能的,因为可以将负载分配给D寄存器,然后再解释为Q寄存器。

Some ways I can think of getting this working would be: 我可以想到的一些方法是:

  • data = vcombine_u8(vld1_u8(src), vdup_n_u8(0)); - compiler seems to go to the effort of setting the upper half to 0, even though this is never necessary -编译器似乎会尽力将上半部分设置为0,即使这从来没有必要
  • data = vld1q_u8(src); - doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units? -可以进行128位加载(在我的情况下很好),但是在具有64位NEON单元的处理器上运行起来可能会更慢?

I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value. 我想在CPU中可能会出现部分依赖的棘手情况,只设置了一半这样的寄存器,但是我宁愿编译器在这里找出最好的方法,而不是强迫它使用0值。

Is there any way to do this? 有什么办法吗?

On aarch32 , you are completely at the compiler's mercy on this. aarch32 ,您完全可以接受编译器的控制。 (That's why I write NEON routines in assembly) (这就是为什么我在汇编中编写NEON例程的原因)

On aarch64 on the other hand, it's pretty much automatic since the upper 64bit isn't directly accessible anyway. 另一方面,在aarch64上,由于几乎无法直接访问高64位,因此它几乎是自动的。

The compiler will execute trn1 instruction upon vcombine though. 编译器将执行trn1于指令vcombine虽然。

To sum it up, There is always overhead involved on aarch64 while it's unpredictable on aarch32 . 概括起来,总是有参与的开销上aarch64虽然这是无法预测的aarch32 If your aarch32 routine is simple and short, thus not many registers are necessary, chances are good that the compiler assigns the registers cleverly, but VERY unlikely otherwise. 如果您的aarch32例程简单而又短,因此不需要太多寄存器,则编译器很可能会巧妙地分配寄存器,否则非常不可能。

BTW, on aarch64 , if you initialize the lower 64bit, the CPU automatically sets the upper 64bit to zero. 顺便说一句,在aarch64 ,如果您初始化低64位,则CPU会自动将高64位设置为零。 I don't know if it costs extra time though. 不过,我不知道是否要花费额外的时间。 It did cost me several days until I found out what had been wrong all the time along. 确实花了我几天的时间,直到我一直发现问题出在哪里。 So annoying!!! 很烦人!!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用ARM Neon内在函数进行128位旋转 - 128-bit rotation using ARM Neon intrinsics 计算 64 位乘以 128 位乘积的低 128 位需要多少次 64 位乘法? - How many 64-bit multiplications are needed to calculate the low 128-bits of a 64-bit by 128-bit product? 128位结构或2个64位记录,用于提高性能和可读性 - 128-bit struct or 2 64-bit records for performance and readibility SIMD使用无符号乘法对64位* 64位到128位进行签名 - SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit 以 64 位整数为模计算 128 位整数的最快方法 - Fastest way to calculate a 128-bit integer modulo a 64-bit integer NEON:将uint8_t数组加载到128位寄存器中 - NEON: loading uint8_t array into 128 bit register C中的ARM Neon:如何在使用内在函数时组合不同的128位数据类型? - ARM Neon in C: How to combine different 128bit data types while using intrinsics? 如何对64位寄存器的低32位进行BSWAP? - How to BSWAP the lower 32-bit of 64-bit register? 使用 ARM Neon 内在函数从 64 位访问 32 位 - Accessing 32bit from 64bit using ARM Neon intrinsics 将 64 位整数相除,就像被除数左移 64 位一样,没有 128 位类型 - Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM