调试迭代斐波那契（手动从 C 转换为 RISC-V）

Question

I'm trying to learn assembly so I tried to convert this c code to RISC-v assembly code:我正在尝试学习汇编，所以我尝试将此 c 代码转换为 RISC-v 汇编代码：

int Fib_Iter(int x){

int Fib_f, Fib_s, temp, Result;
Fib_f = 0;
Fib_s = 1;
if (x == 0)
    Result = Fib_f;
else if (x == 1)
    Result = Fib_s;
else
{
    while (x >= 2)
    {
        temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    Result = Fib_s;
}
return Result;
}

and this is my RISC-V assembly code:这是我的 RISC-V 汇编代码：

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12
    #res in x11
    #x28 =1
    #x29=2
    addi sp,sp,-16
    sw x5,0(sp)
    sw x6,4(sp)
    sw x7,8(sp)
    sw x11,12(sp)              #saving x5,x6,x7,x11 in the stack
    addi x28,x0,1             #adding 1 to x28 to use it in the 2ndif statment
    addi x29,x0,2              #adding 2 to x28 to use it in the 3rd if statment
    bne x12,x0,lb1             #first if statment to check if x==0
    add x11,x6,x0               #if x !=0 make result(x11) equal to (fb_f)x6
    lb1:
    bne x12,x28,lb2           #2nd if statement to check if x==1
    add x11,x7,x0            #if x !=1 make result(x11) equal to (fb_s)x6
    lb2:                      #if it equals to 1 start the loop
    loop:
    add x5,x6,x7             #just an add's 
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop       #check if x >2 if yes go to the loop else 
    addi x11,x7,0          #result(x11)=x7
    lw x5,0(sp)            #load all the registers from the stack
    lw x6,4(sp)
    lw x7,8(sp)
    lw x11,12(sp)
    addi sp,sp,16         #return the stack pointer to its original place
    jalr x0,0(x1)

but I'm not getting the right value in Register 11 when using the venus simulator.但是在使用 venus 模拟器时，我在寄存器 11 中没有得到正确的值。

When I call it using the value 4, I got 4 but the right answer is 3.当我使用值 4 调用它时，我得到了 4，但正确答案是 3。

Answer 1

You have pseudo code in C, which is great.您在 C 中有伪代码，这很棒。

Comments:注释：

There are some missing statements in the translation to assembly.在汇编的翻译中有一些遗漏的陈述。
The if-then-else statements aren't right: only one of then/else should be executed, and you have to tell the processor that by avoiding the else part from the then part. if-then-else 语句是不正确的：应该只执行 then/else 中的一个，并且您必须通过避免 then 部分中的 else 部分来告诉处理器。
The register usage is off from the standard calling convention.寄存器使用偏离标准调用约定。 We expect the first integer parameter in a0 , so that we would expect your x to be in a0 .我们希望第一个 integer 参数位于a0中，因此我们希望您的x位于a0中。 The return value should also be in a0 .返回值也应该在a0中。 There's no need to save t0 , t1 , t2 , a1 — they are call clobbered just like t3 , t4 (which you aren't saving and don't have to).没有必要保存t0 、 t1 、 t2 、 a1 ——它们像t3 、 t4一样被调用（你没有保存也不必保存）。

If you want to have your own calling convention, that's fine, but you're returning a value in a1 and also restoring a1 from the stack, and those don't make sense together.如果您想拥有自己的调用约定，那很好，但是您在a1中返回一个值并从堆栈中恢复a1 ，而这些一起没有意义。

See comments inline:见内联评论：

    int Fib_Iter(int x) {   <------------- x is passed in a0
        int Fib_f, Fib_s, temp, Result;
        Fib_f = 0;          <------------- where is this line in the assembly??
        Fib_s = 1;          <------------- where is this line in the assembly??
        if (x == 0)
            Result = Fib_f;
        else if (x == 1)
            Result = Fib_s;
        else
        {
            while (x >= 2)
            {
                temp = Fib_f + Fib_s;
                Fib_f = Fib_s;
                Fib_s = temp;
                x--;
            }
            Result = Fib_s;
        }
        return Result;   <--------- return value goes in a0
    }

See comments inline:见内联评论：

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12       <---- x should be found in a0
    #res in x11     <---- return value should be put in a0 (at the end of the function)
    #x28 =1
    #x29=2
    addi sp,sp,-16  <---- no stack space needed
    sw x5,0(sp)     <---- no need to save t0
    sw x6,4(sp)     <---- no need to save t1
    sw x7,8(sp)     <---- no need to save t2
    sw x11,12(sp)   <---- no need to save a1
    addi x28,x0,1
    addi x29,x0,2
    bne x12,x0,lb1
    add x11,x6,x0   <---- then part, good
                    <---- missing code to skip else part
                       after executing a then part the code
                       (should skip the else part)
                       should resume the logically next thing after the if
    lb1:
    bne x12,x28,lb2
    add x11,x7,x0
                    <---- missing code to skip else part
    lb2:
    loop:
    add x5,x6,x7
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop
    addi x11,x7,0
    lw x5,0(sp)      <---- no need to reload t0
    lw x6,4(sp)      <---- no need to reload t1
    lw x7,8(sp)      <---- no need to reload t2
    lw x11,12(sp)    <---- no need to reload a1 (this also clobbers current a1 return value)
    addi sp,sp,16    <---- no stack space is needed
    jalr x0,0(x1)

Answer 2

Just for the record, your C can be simplified.仅作记录，您的 C 可以简化。 You don't need to separately check for x == 0 and x == 1 ;您不需要单独检查x == 0和x == 1 ； you can do if (x < 2) return x; if (x < 2) return x;你可以这样做to return either 0 or 1.返回 0 或 1。

(I assume you don't intend to handle negative inputs, so unsigned might have been a good choice. Your C returns 1 for negative x, reaching the loop but running 0 iterations, leaving Fib_s unmodified. But I assume that's not important behaviour.) （我假设您不打算处理负输入，因此unsigned可能是一个不错的选择。您的 C 为负 x 返回1 ，到达循环但运行 0 次迭代，使 Fib_s 保持不变。但我认为这不是重要的行为。 )

A minimal implementation in asm can be much simpler than your version. asm 中的最小实现可能比您的版本简单得多。 This is a leaf function (no calls to other functions) so we can use call-clobbered ("temporary") registers for everything.这是一个叶子 function（不调用其他函数），因此我们可以对所有内容使用调用破坏（“临时”）寄存器。 I used the " ABI names " for registers to help keep track of which are traditionally used for arg-passing and call-clobbered vs. call-preserved.我将“ ABI 名称”用于寄存器，以帮助跟踪传统上用于 arg-passing 和 call-clobbered 与 call-preserved 的。

Actually I got good asm from clang, for this C:实际上我从 clang 得到了很好的 asm，对于这个 C：

int Fib_Iter(int x){
    if (x < 2) 
        return x;

    int Fib_f = 0, Fib_s = 1;
    while (x >= 2) {
        int temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    return Fib_s;
}

Godbolt compiler explorer, RISC-V clang -O3 Godbolt 编译器资源管理器， RISC-V clang -O3

# clang -O3 output:
# arg:     int x      in  a0
# returns: int Fib(x) in a0

    Fib_Iter:
        addi    a1, zero, 2
        blt     a0, a1, .LBB0_3         # if(x<2) goto ret with x still in a0 as the retval
                # otherwise fall through to the rest and init vars
        mv      a3, zero                # Fib_f = 0
        addi    a2, a0, 1               #  tmpcount = x+1  compiler invented this
        addi    a0, zero, 1             # Fib_s = 1
          # x>=2 is known to be true on first iteration so a standard do{}while() loop structure works
.LBB0_2:                                # do{
        add     a4, zero, a0                # tmp = Fib_s
        addi    a2, a2, -1                  # tmpcount--
        add     a0, a0, a3                  # Fib_s += Fib_f
        add     a3, zero, a4                # Fib_f = tmp
        blt     a1, a2, .LBB0_2         # }while(2<tmpcount);
.LBB0_3:
        ret

Same logic should work for unsigned, avoiding the weirdness of returning a negative x .同样的逻辑应该适用于无符号，避免返回负x的怪异。 clang compiles it somewhat differently with unsigned types, but I don't think that's necessary. clang 对unsigned类型的编译略有不同，但我认为没有必要。

The tmpcount = x+1 can probably be avoided using ble (reversed operands for bge ) instead of blt so we can use 2 and x directly, saving another instruction. tmpcount = x+1可能可以避免使用ble （ bge的反向操作数）而不是blt所以我们可以直接使用2和x ，节省另一条指令。

Fibonacci unrolls very nicely: a += b;斐波那契展开得非常好： a += b; b += a; takes one instruction per step, not 3. Checking the branch condition at each step could actually be best for static code size, as well as much better for dynamic instruction count.每一步需要一条指令，而不是 3。在每一步检查分支条件实际上可能最适合 static 代码大小，并且对于动态指令数更好。 (Related: an x86 asm answer that stores an array of Fibonacci values, including an unrolled version that only checks the branch condition once per loop, using clever startup to handle odd vs. even before entering the loop). （相关：一个x86 asm 答案，它存储了一个斐波那契值数组，包括一个展开的版本，每个循环只检查一次分支条件，在进入循环之前使用聪明的启动来处理奇数与偶数）。

(Of course if you're optimizing for non-tiny n , evaluating Fibonacci(n) can be done in log2(n) shift and multiply / add steps, instead of n additions, using fancier math .) （当然，如果您正在针对非微小n进行优化，则可以使用更高级的数学在 log2(n) 移位和乘法/加法步骤中完成 Fibonacci(n) 的评估，而不是 n 加法。）

This is with an unroll that just repeats the loop condition.这是一个重复循环条件的展开。 The loop exit logic is non-trivial to verify for correctness, though, so it's shorter but not simpler.但是，循环退出逻辑对于验证正确性来说并非易事，因此它更短但并不简单。

# arg:   unsigned n  in  a0
# returns:    Fib(n) in  a0

fib_unrolled:
        addi    t2, zero, 2
        bltu    a0, t2, .Lsmall_n         # if(n<2) return n
                # otherwise fall through
        mv      t0, zero               # a=0
        addi    t1, zero, 1            # b=1
             # known: x>=2  (unsigned) before first iteration
.Lloop:                                # do{
        beq     a0, t2, .first_out          # if(n==2) return a+b;
        addi    a0, a0, -2                  # n-=2
        add     t0, t0, t1                  # a += b
        add     t1, t1, t0                  # b += a
        bgeu    a0, t2, .Lloop         # }while(n >= 2);

        mv      a0, t1
.Lsmall_n:
        ret

.Lfirst_out:
        add      a0, t0, t1      # add instead of mv so the beq can be earlier
        ret

I was able to get clang to almost exactly reproduce this from C source (with different register numbers, but the exactly same loop order.) Including putting the add a+b block after the regular fall-through ret.我能够让 clang 几乎完全从 C 源中重现此内容（具有不同的寄存器编号，但循环顺序完全相同。）包括将add a+b块放在常规的贯穿 ret 之后。 It schedules the instructions in the loop body better than I did, separating the two fib sequence additions if we're assuming a dual-issue in-order pipeline.它比我更好地安排循环体中的指令，如果我们假设双问题有序流水线，则将两个 fib 序列添加分开。 However, clang still insists on wasting an instruction loading a 1 constant by turning n >= 2 into n > 1 ;但是，clang 仍然坚持通过将n >= 2变为n > 1来浪费加载1常量的指令； RISC-V can do bgeu as a reversed-operands bltu ( https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf ) and it already has 2 in a register. RISC-V 可以将bgeu作为反向操作数bltu ( https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf ) 并且它已经在寄存器中2

unsigned fib_unroll(unsigned n){
    if (n < 2) 
        return n;

    unsigned a = 0, b = 1;
    do {
        if (n == 2) return a+b;
        a += b;
        n -= 2;          // clang schedules this way, better for an in-order pipeline
        b += a;
    }while (n >= 2);
    return b;
}

A more clever unroll: one branch inside the loop.更聪明的展开：循环内的一个分支。 Init the two vars based on count being odd or even根据计数为奇数或偶数初始化两个变量

If we want to always do an even number of additions, we can arrange to start with either 0, 1 or 1, 0 , depending on n & 1 .如果我们想总是做偶数加法，我们可以安排从0, 1或1, 0开始，具体取决于n & 1 。 Starting with 1,0 means we first do 1+=0 , basically skipping an iteration.从 1,0 开始意味着我们首先执行1+=0 ，基本上跳过了一次迭代。 (I originally came up with this trick for this x86 Fibonacci answer ) （我最初想出了这个x86 斐波那契答案的技巧）

unsigned fib_unroll2_simpler(unsigned n){
    if (n < 2) 
        return n;
    unsigned b = n&1;  // current
    unsigned a = b^1;  // prev
    // start with 0,1 or 1,0 to get either 1 or 2 after two additions
    do {
        n -= 2;
        a += b;
        b += a;
    }while (n >= 2);
    return b;
}

This has a long data dependency from n to the result, especially for smallish n, doing some basically wasted work.这从 n 到结果有很长的数据依赖性，特别是对于较小的 n，做一些基本上是浪费的工作。 Not great on wide out-of-order exec machines for small n.对于小 n 来说，在广泛的无序执行机器上不是很好。 So it's interesting but for a real use-case you might still want a table lookup for small-n starting points.所以这很有趣，但对于一个真实的用例，您可能仍然需要一个表查找小 n 起点。 clang does a very reasonable job, but wastes some instructions around the start: clang 做得非常合理，但在开始时浪费了一些指令：

fib_unroll2_simpler:
        addi    a2, zero, 2
        add     a1, zero, a0
        bgeu    a0, a2, .LBB0_2
        add     a0, zero, a1            # copy `n` back into a0 where it already was?!?
        ret
.LBB0_2:                                # the non-tiny n common case has a taken branch
        andi    a0, a1, 1
        xori    a2, a0, 1
        addi    a3, zero, 1          # constant 1 to compare against
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        addi    a1, a1, -2
        add     a2, a2, a0
        add     a0, a0, a2
        bltu    a3, a1, .LBB0_3      # }while(1<n); fails to reuse the 2 it already had in a2 earlier
        ret

Depending on the cost of branching, it might be better to branch into the middle of the loop to start things off.根据分支的成本，最好分支到循环的中间以开始工作。 This also means we can always start with 1,1 when we enter the loop, instead of spending an iteration adding zeros.这也意味着当我们进入循环时我们总是可以从1,1开始，而不是花费一个迭代来添加零。 But that makes n==2 a special case: we need to return 1, and can't do any additions of 1+1.但这使得n==2成为一种特殊情况：我们需要返回 1，并且不能对 1+1 进行任何加法运算。 But 1 is one of our special-case return values, so we can tweak that path to return n != 0 and let the rest of the function assume n >= 3 or higher.但是 1 是我们的特殊情况返回值之一，因此我们可以调整该路径以返回n != 0并让 function 的 rest 假设 n >= 3 或更高。

With some further optimization to minimize instruction count for RISC-V (eg avoiding the need to construct a constant 2 in a register to shorten the non-tiny-n common case), I came up with this.通过进一步优化以最小化 RISC-V 的指令数（例如，避免需要在寄存器中构造一个常量2以缩短非小 n 常见情况），我想出了这个。 (A _v1 version is in the Godbolt link) （一个 _v1 版本在 Godbolt 链接中）

unsigned fib_unroll2_branchy_v2(unsigned n){
    if (n <= 2) 
        return n!=0;  // 0 or 1
    n -= 3;     // check for n<=2 and copy n on a machine without FLAGS
    unsigned i = n&1;

    unsigned b = 1;
    //if (n==2) return b;  // already eliminated.
    unsigned a = 1;

    if (i == 0) goto odd_entry;   // n-=3 flips the low bit, so this detects odd n
    do{
        a += b;
odd_entry:
        i += 2;
        b += a;
    }while (i <= n);  // safe even for n near uint_max because we subtracted 3 first
    return b;
}

clang doesn't do an optimal job here, wasting some copy instructions in the loop that we conditionally jump into. clang 在这里没有做最佳工作，在我们有条件地跳转到的循环中浪费了一些复制指令。 (Compilers often have a hard time when you do that in C, but it's sometimes a useful trick for hand-written asm). （当您在 C 中执行此操作时，编译器通常会遇到困难，但这有时对于手写 asm 来说是一个有用的技巧）。 So instead, here's a hand-written version that doesn't suck as much:所以取而代之的是，这是一个不那么糟糕的手写版本：

fib_unroll2_branchy_v2:
        addi    t2, a0, -3            # n -= 3  (leaving a copy of the orig)
        bleu    t2, a0, .Lsmall_n     # if( (n-3) > n) detect wrapping, i.e. n<=2

        andi    t0, t2, 1             # i = n&1
        addi    a0, zero, 1           # b = retval, replacing orig_n
        addi    a1, zero, 1           # a
        beqz    t0, .Lodd_entry       # even i means orig_n was odd

.Lloop:                              # do{
        add     a1, a1, a0            # a += b
.Lodd_entry:
        addi    t0, t0, 2             # i += 2
        add     a0, a0, a1            # b += a
        bleu    t0, t2, .Lloop       # }while(i <= n);
        ret

.Lsmall_n
        snez    a0, a0                # return orig_n != 0 handles n<3
        ret

There may be a few optimizations I missed.我可能错过了一些优化。 In fib_unroll2_simpler (the branchless one), it would be nice to find some ILP (instead of basically one long dependency chain apart from eventually n-=2 ), or get a jump-start on reaching the loop termination by doing fewer iterations instead of turning the first half of the loop into a no-op.在fib_unroll2_simpler （无分支的）中，最好找到一些 ILP（而不是基本上除了n-=2之外的一个长依赖链），或者通过减少迭代而不是将循环的前半部分变为无操作。 This version just needs the final result, doesn't need to store every Fib value along the way into an array like my x86 answer did.这个版本只需要最终结果，不需要像我的 x86 回答那样将每个 Fib 值存储到一个数组中。

Even the branchy_v2 version feels like a longer dep chain than we'd really like to init i , but it'll be fine on a not-super-wide pipeline.即使是 branchy_v2 版本也感觉像一个比我们真正想要初始化i更长的 dep 链，但在不是超宽的管道上它会很好。

调试迭代斐波那契（手动从 C 转换为 RISC-V）

问题描述

2 个解决方案

解决方案1
3 2020-12-19 18:01:07

解决方案2
1 2020-12-20 00:50:25

A more clever unroll: one branch inside the loop.更聪明的展开：循环内的一个分支。 Init the two vars based on count being odd or even根据计数为奇数或偶数初始化两个变量

调试迭代斐波那契（手动从 C 转换为 RISC-V）

问题描述

2 个解决方案

解决方案1 3 2020-12-19 18:01:07

解决方案2 1 2020-12-20 00:50:25

A more clever unroll: one branch inside the loop.更聪明的展开：循环内的一个分支。 Init the two vars based on count being odd or even根据计数为奇数或偶数初始化两个变量

解决方案1
3 2020-12-19 18:01:07

解决方案2
1 2020-12-20 00:50:25