用C ++编写汇编代码

Question

I have the following code in C++: 我在C ++中有以下代码：

inline void armMultiply(const float* __restrict__ src1,
                        const float* __restrict__ src2,
                        float* __restrict__ dst)
{
    __asm volatile(
                 "vld1.f32 {q0}, [%[src1]:128]!      \n\t"
                 :
                 :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2)
                 );
}

Why do I get the error vector register expected ? 为什么我得到预期的错误向量寄存器？

Answer 1

You're getting this error because your inline assembly is for 32 bit arm, but you're compiling for 64 bit arm (with clang - with gcc you would have gotten a different error). 之所以会出现此错误，是因为您的内联汇编程序是针对32位arm的，但是您正在针对64位arm进行编译（使用clang-使用gcc，您将得到另一个错误）。

(Inline) assembly is different between 32 and 64 bit arm, so you need to guard it with eg #if defined(__ARM_NEON__) && !defined(__aarch64__) , or if you want to have different assembly for both 64 and 32 bit: #ifdef __aarch64__ .. #elif defined(__ARM_NEON__) , etc. （内联）组件是32位和64位臂之间是不同的，所以你需要用如来保护它#if defined(__ARM_NEON__) && !defined(__aarch64__)或者如果你想有两个64位和32位不同的组件： #ifdef __aarch64__ .. #elif defined(__ARM_NEON__) ，等等。

As others commented, unless you really need to manually handtune the produced assembly, intrinsics can be just as good (and in some cases, better than what you produce yourself). 正如其他人所评论的那样，除非您真的需要手动调整生产的程序集，否则内在函数可能会一样好（在某些情况下，它会比您自己生产的函数好）。 You can eg do the two vld1_f32 calls, one vmul_f32 and one vst1_f32 via intrinsics just fine. 您可以通过内在函数很好地进行两个vld1_f32调用，一个vmul_f32和一个vst1_f32调用。

EDIT: 编辑：

The corresponding inline assembly line for loading into a SIMD register on 64 bit would be: 要加载到64位SIMD寄存器中的相应内联汇编行将是：

"ld1 {v0.4s}, [%[src1]], #16      \n\t"

To support both, your function could look like this instead: 为了同时支持这两种功能，您的函数应如下所示：

inline void armMultiply(const float* __restrict__ src1,
                        const float* __restrict__ src2,
                        float* __restrict__ dst)
{
#ifdef __aarch64__
    __asm volatile(
                 "ld1 {v0.4s}, [%[src1]], #16      \n\t"
                 :
                 :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2)
                 );
#elif defined(__ARM_NEON__)
    __asm volatile(
                 "vld1.f32 {q0}, [%[src1]:128]!      \n\t"
                 :
                 :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2)
                 );
#else
#error this requires neon
#endif
}

Answer 2

Assuming we're talking about GCC, the docs say that you should be using "w" ("Floating point or SIMD vector register") instead of "r" ("register operand is allowed provided that it is in a general register") as the constraint. 假设我们在谈论GCC，文档说您应该使用“ w”（“浮点或SIMD矢量寄存器”）而不是“ r”（“允许在通用寄存器中使用寄存器操作数”）作为约束。

https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints

https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Simple-Constraints.html#Simple-Constraints https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Simple-Constraints.html#Simple-Constraints

用C ++编写汇编代码

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-02-28 07:34:26

解决方案2
0 2017-02-27 22:28:58

用C ++编写汇编代码

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-02-28 07:34:26

解决方案2 0 2017-02-27 22:28:58

解决方案1
1 已采纳 2017-02-28 07:34:26

解决方案2
0 2017-02-27 22:28:58