简体   繁体   English

ARM内联汇编代码,错误为“ asm中不可能约束”

[英]ARM inline assembly code with error “impossible constraint in asm”

I am trying to optimize the following code complex.cpp: 我正在尝试优化以下代码complex.cpp:

typedef struct {
    float re;
    float im;
} dcmplx;

dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
    int    i;
    dcmplx    z, xout;

    xout.re = xout.im = 0.0;
    asm volatile (
    "movs r3, #0\n\t"
    ".loop:\n\t"
    "vldr s11, [%[hat], #4]\n\t"
    "vldr s13, [%[hat]]\n\t"
    "vneg.f32 s11, s11\n\t"
    "vldr s15, [%[buf], #4]\n\t"
    "vldr s12, [%[buf]]\n\t"
    "vmul.f32 s14, s15, s13\n\t"
    "vmul.f32 s15, s11, s15\n\t"
    "adds %[hat], #8\n\t"
    "vmla.f32 s14, s11, s12\n\t"
    "vnmls.f32 s15, s12, s13\n\t"
    "adds %[buf], #8\n\t"
    "vadd.f32 s1, s1, s14\n\t"
    "vadd.f32 s0, s0, s15\n\t"
    "adds r3, r3, #1\n\t"
    "cmp r3, r0\n\t"
    "bne .loop\n\t"
    : "=r"(xout)
    : [hat]"r"(hat),[buf]"r"(buf) 
    : "s0","cc"
    );
    return xout;
}

When it is compiled with "arm-linux-gnueabihf-g++ -c complex.cpp -o complex.o -mfpu=neon", I got the following error: impossible constraint in 'asm'. 当使用“ arm-linux-gnueabihf-g ++ -c complex.cpp -o complex.o -mfpu = neon”进行编译时,出现以下错误:'asm'中的不可能约束。

When I comment out "=r"(xout), the compile doesn't complain, but how can I get result of register 's0' into xout? 当我注释掉“ = r”(xout)时,编译不会抱怨,但是如何获取寄存器's0'到xout的结果呢?

Besides, how it works if r0 contains return value but the return type is a complicate structure, since r0 is only a 32-bit? 另外,如果r0只是一个32位,那么如果r0包含返回值但返回类型是一个复杂的结构,它将如何工作? register. 寄存器。

The original c code I post here: 我在这里发布的原始C代码:

dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
    int    i;
    dcmplx    z, xout;
    xout.re = xout.im = 0.0;
    for(int i = 0; i < len; i++) {
        z = BI_dcmul(BI_dconjg(hat[i]),buf[i]);
        xout = BI_dcadd(xout,z);
    }
    return xout;
}
dcmplx BI_dcmul(dcmplx x, dcmplx y)
{
    dcmplx    z;
    z.re = x.re * y.re - x.im * y.im;
    z.im = x.im * y.re + x.re * y.im;
    return z;
}
dcmplx BI_dconjg(dcmplx x)
{
    dcmplx    y;
    y.re = x.re;
    y.im = -x.im;
    return y;
}
dcmplx BI_dcadd(dcmplx x, dcmplx y)
{
    dcmplx    z;
    z.re = x.re + y.re;
    z.im = x.im + y.im;
    return z;
}

Your inline assembly code makes a number of mistakes: 您的内联汇编代码会犯许多错误:

  • It tries to use a 64-bit structure as an operand with a 32-bit output register ( "=r" ) constraint. 它尝试将64位结构用作具有32位输出寄存器( "=r" )约束的操作数。 This is what gives you the error. 这就是给您错误的原因。
  • It doesn't use that output operand anywhere 它不在任何地方使用该输出操作数
  • It doesn't tell the compiler where the output actually is (S0/S1) 它不会告诉编译器输出实际在哪里(S0 / S1)
  • It doesn't tell the compiler that len is supposed to be an input 它不会告诉编译器len应该是输入
  • It clobbers a number of registers, R3, S11, S12, S13, S14, S14, without telling the compiler. 它在不通知编译器的情况下破坏了多个寄存器R3,S11,S12,S13,S14,S14。
  • It uses a label .loop that unnecessarily prevents the compiler from inlining your code in multiple places. 它使用标签.loop ,不必要地防止编译器在多个位置内联代码。
  • It doesn't actually appear to be the equivalent of the C++ code you've shown, calculating something else instead. 它实际上似乎不等于您所显示的C ++代码,而是计算其他内容。

I'm not going to bother to explain how you can fix all these mistakes, because you shouldn't be using inline assembly . 我不会费心去解释如何解决所有这些错误,因为您不应该使用内联汇编 You can write your code in C++ and let the compiler do the vectorization. 您可以用C ++编写代码,然后让编译器进行矢量化。

For example compiling following code, equivalent to your example C++ code, with GCC 4.9 and the -O3 -funsafe-math-optimizations options: 例如,使用GCC 4.9和-O3 -funsafe-math-optimizations选项编译与您的示例C ++代码等效的以下代码:

dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
    int    i;
    dcmplx xout;
    xout.re = xout.im = 0.0;
    for (i = 0; i < len; i++) {
        xout.re += hat[i].re * buf[i].re + hat[i].im * buf[i].im;
        xout.im += hat[i].re * buf[i].im - hat[i].im * buf[i].re;
    }
    return xout;
}

generates the following assembly as its inner loop: 生成以下程序集作为其内部循环:

.L97:
    add lr, lr, #1
    cmp ip, lr
    vld2.32 {d20-d23}, [r5]!
    vld2.32 {d24-d27}, [r4]!
    vmul.f32    q15, q12, q10
    vmul.f32    q14, q13, q10
    vmla.f32    q15, q13, q11
    vmls.f32    q14, q12, q11
    vadd.f32    q9, q9, q15
    vadd.f32    q8, q8, q14
    bhi .L97

Based on your inline assembly code, it's likely that the compiler generated better than what you would've come up with if you tried to vectorize it yourself. 根据您的内联汇编代码,编译器生成的效果可能好于您尝试对其进行矢量化处理时所产生的效果。

The -funsafe-math-optimizations is necessary because the NEON instructions aren't fully IEEE 754 conformant. -funsafe-math-optimizations是必要的,因为NEON指令不完全符合IEEE 754。 As the GCC documentation states: 正如GCC文档所述:

If the selected floating-point hardware includes the NEON extension (eg -mfpu='neon' ), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. 如果所选浮点硬件包括NEON扩展名(例如-mfpu='neon' ),请注意,除非也指定了-funsafe-math-optimizations否则GCC的自动矢量化过程不会生成浮点运算。 This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision. 这是因为NEON硬件没有完全实现用于浮点算术的IEEE 754标准(特别是非正规值被视为零),因此使用NEON指令可能会导致精度降低。

I should also note that the compiler generates almost as good as code above if you don't roll your own complex type, like in the following example: 我还应注意,如果您不滚动自己的复杂类型,则编译器生成的代码几乎与上面的代码一样好,如以下示例所示:

#include <complex>
typedef std::complex<float> complex;
complex ComplexConv_std(int len, complex *hat, complex *buf)
{
    int    i;
    complex xout(0.0f, 0.0f); 
    for (i = 0; i < len; i++) {
        xout += std::conj(hat[i]) * buf[i];
    }
    return xout;
}

One advantage to using your own type however, is that you can improve the code compiler generates making one small change to how you declare struct dcmplx : 但是,使用自己的类型的一个优点是,可以改进代码编译器生成的代码,对声明struct dcmplx进行一些小的更改:

typedef struct {
    float re;
    float im;
} __attribute__((aligned(8)) dcmplx;

By saying it needs to be 8-byte (64-bit) aligned, this allows the compiler to skip the check to see if it is suitably aligned and then fall back on the slower scalar implementation instead. 通过说它需要8字节(64位)对齐,可以使编译器跳过检查以查看它是否适当对齐,然后转而使用较慢的标量实现。

Now, hypothetically, lets say you were unsatisfied with how GCC vectorized your code and thought you could do better. 现在,假设,您可以说您对GCC如何对您的代码进行矢量处理感到不满意,并认为您可以做得更好。 Would this justify using inline assembly? 使用内联汇编是否合理? No, the next thing to try are the ARM NEON intrinsics . 不,接下来要尝试的是ARM NEON内部函数 Using intrinics is just like normal C++ programming, you don't have worry about a bunch of special rules you need to follow. 使用内在函数就像普通的C ++编程一样,您不必担心需要遵循的一些特殊规则。 For example here's how I converted the vectorized assembly above into this untested code that uses intrinsics: 例如,这是我如何将上面的矢量化程序集转换为使用内在函数的未经测试的代码:

#include <assert.h>
#include <arm_neon.h>
dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
    int    i;
    dcmplx xout;

    /* everything needs to be suitably aligned */
    assert(len % 4 == 0);
    assert(((unsigned) hat % 8) == 0);
    assert(((unsigned) buf % 8) == 0);

    float32x4_t re, im;
    for (i = 0; i < len; i += 4) {
        float32x4x2_t h = vld2q_f32(&hat[i].re);
        float32x4x2_t b = vld2q_f32(&buf[i].re);
        re = vaddq_f32(re, vmlaq_f32(vmulq_f32(h.val[0], b.val[0]),
                                     b.val[1], h.val[1]));
        im = vaddq_f32(im, vmlsq_f32(vmulq_f32(h.val[1], b.val[1]),
                                     b.val[0], h.val[0]));
    }
    float32x2_t re_tmp = vadd_f32(vget_low_f32(re), vget_high_f32(re));
    float32x2_t im_tmp = vadd_f32(vget_low_f32(im), vget_high_f32(im));
    xout.re = vget_lane_f32(vpadd_f32(re_tmp, re_tmp), 0);
    xout.im = vget_lane_f32(vpadd_f32(im_tmp, im_tmp), 0);
    return xout;
}

Finally if this wasn't good enough and you needed to tweak out every bit of performance you could then it's still not a good idea to use inline assembly. 最后,如果这还不够好,并且您需要调整性能的每一点,那么使用内联汇编仍然不是一个好主意。 Instead your last resort should be to use regular assembly instead. 相反,您最后的选择应该是使用常规汇编。 Since your rewriting most of the function in assembly, you might as well write it completely in assembly. 由于您在汇编中重写了大多数功能,因此最好在汇编中完全编写它。 That means you don't have worry about telling the compiler about everything you're doing in the inline assembly. 这意味着您不必担心将内联汇编中正在执行的所有操作告诉编译器。 You only need to conform to the ARM ABI, which can be tricky enough, but is a lot easier than getting everything correct with inline assembly. 您只需要遵循ARM ABI,这可能会很棘手,但是比通过内联汇编正确设置要容易得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM