arm assembly memset replacement

Question

I'm new to ARM assembly, so bare with me. I'm writing a music visualization app for Android. I'm at a point to where I want to work on optimizations, so right now I'm sort of experimenting. Below is my attempt at an 8bit memset hybrid ASM and C implementation.

Somewhere it is causing a crash. I am unable to attach gdb to the process because the app exits before gdb starts, so I am unable to step through the operations.

Does this look right? I've never quite wrapped by head around memory alignment, but I do know ARM is 4 byte aligned. I'm not sure if this is a hint toward a solution or not. I think the hybrid approach of stacking the bulk of the operations in an assembly loop, then finishing it off at 8 bytes per pass takes care of any alignment issues. Am I correct in thinking this? I'm baffled about what's going wrong. This is really similar to the memcpy function, and my only issue with that at the time was the clobber list was empty. Adding those registers to the clobber list finished off the function, and I just can't figure out what I'm missing with this memset function.

Any hints?

* Memset functions, 1 byte memset */
static void *mem_set8_arm (void *dest, int c, visual_size_t n)
{
    uint32_t *d = dest;
    uint8_t *dc = dest;
    uint32_t setflag32 =
        (c & 0xff) |
        ((c << 8) & 0xff00) |
        ((c << 16) & 0xff0000) |
        ((c << 24) & 0xff000000);
    uint8_t setflag8 = c & 0xff;

#if defined(VISUAL_ARCH_ARM)

    while (n >= 64) {
        __asm __volatile
        (
            "\n\t mov r4, %[flag]"
            "\n\t mov r5, r4"
            "\n\t mov r6, r4"
            "\n\t mov r7, r4"
            "\n\t stmia %[dst]!,{r4-r7}"
            "\n\t stmia %[dst]!,{r4-r7}"
        :: [dst] "r" (d), [flag] "r" (&setflag32) : "r4", "r4", "r6", "r7");

        d += 16;

        n -= 64;
    }

#endif /* VISUAL_ARCH_ARM */

    while (n >= 4) {
        *d++ = setflag32;
        n -= 4;
    }

    dc = (uint8_t *) d;

    while (n--)
        *dc++ = setflag8;

    return dest;
}

Answer 1

stmia with four Registers writes 16 bytes, so doing it twice writes 32 bytes. You are adding 16 to a pointer to 32 bit values, effectively adding 64 each time, so there will be holes.

Also, ARM does not have 32 bit immediates, but a lot of assemblers work around that by generating a data field in a special area behind the function and turning the mov into a PC-relative ldr . Check the generated assembler output whether perhaps that field was generated in the middle of the instruction stream.

Also, you can just generate the 32 bit value in assembler:

mov r4, %[mask]
orr r4, r4, r4 lsl #16
orr r4, r4, r4 lsl #8

As this is an 8 bit immediate, it fits, and no ldr needs to be generated.

While you are at it, just pull the entire loop into assembler so you can reuse the address register. gcc is notoriously bad at optimizing routines containing inline assembler.

Answer 2

Is that a typo:

:: [dst] "r" (d), [flag] "r" (&setflag32) : "r4", "r4", "r6", "r7");

Didn't you mean "r4", "r5", "r6" ... there?

Will your selfmade memset really faster than the original memset?

arm assembly memset replacement

Question

2 answers

solution1
2 2012-02-23 13:13:17

solution2
1 2012-02-23 11:57:17

arm assembly memset replacement

Question

2 answers

solution1 2 2012-02-23 13:13:17

solution2 1 2012-02-23 11:57:17

solution1
2 2012-02-23 13:13:17

solution2
1 2012-02-23 11:57:17