简体   繁体   中英

ARM assembly: can’t find a register in class ‘GENERAL_REGS’ while reloading ‘asm’

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:

typedef struct UN_256fe{

uint32_t uint32[8];

}UN_256fe;

typedef struct UN_288bite{

uint32_t uint32[9];

}UN_288bite;
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){

asm (

        "umull          r3, r4, %9, %10;\n\t"
        "mov            %0, r3;         \n\t"/*res->uint32[0] = r3*/
        "umull          r3, r5, %9, %11;\n\t"
        "adds           r6, r3, r4;     \n\t"/*res->uint32[1] = r3 + r4*/
        "mov            %1, r6;         \n\t"
        "umull          r3, r4, %9, %12;\n\t"
        "adcs           r6, r5, r3;     \n\t"
        "mov            %2, r6;         \n\t"/*res->uint32[2] = r6*/
        "umull          r3, r5, %9, %13;\n\t"
        "adcs           r6, r3, r4;     \n\t"
        "mov            %3, r6;         \n\t"/*res->uint32[3] = r6*/
        "umull          r3, r4, %9, %14;\n\t"
        "adcs           r6, r3, r5;     \n\t"
        "mov            %4, r6;         \n\t"/*res->uint32[4] = r6*/
        "umull          r3, r5, %9, %15;\n\t"
        "adcs           r6, r3, r4;     \n\t"
        "mov            %5, r6;         \n\t"/*res->uint32[5] = r6*/
        "umull          r3, r4, %9, %16;\n\t"
        "adcs           r6, r3, r5;     \n\t"
        "mov            %6, r6;         \n\t"/*res->uint32[6] = r6*/
        "umull          r3, r5, %9, %17;\n\t"
        "adcs           r6, r3, r4;     \n\t"
        "mov            %7, r6;         \n\t"/*res->uint32[7] = r6*/
        "adc            r6, r5, #0 ;    \n\t"
        "mov            %8, r6;         \n\t"/*res->uint32[8] = r6*/

        : "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
           "=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
         : "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
           "r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
         : "r3", "r4", "r5", "r6", "cc", "memory");

}

EDIT-1: I updated my clobber list based on the first comment, but I still get the same error

A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.

void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res) 
{

  uint32_t mulhi1, mullo1;
  uint32_t mulhi2, mullo2;
  uint32_t tmp;

  asm("umull          %0, %1, %2, %3;\n\t"
       : "=r" (mullo1), "=r" (mulhi1)
       : "r"(A), "r"(B->uint32[7])
  );
  res->uint32[8] = mullo1; /* was 'mov %0, r3; */
  volatile asm("umull          %0, %1, %3, %4;\n\t"
      "adds           %2, %5, %6;     \n\t"/*res->uint32[1] = r3 + r4*/
     : "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
     : "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
     : "cc"
    );
  res->uint32[7] = tmp; /* was 'mov %1, r6; */
  /* ... etc */
}

The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.

By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.

You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,

register int *p1 asm ("r0");

and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.

GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.


The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.

You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM