I have the following code:
for (short l = j; l < j + input->w_small; l = l + 4){
add_b = k * input->w_big + l;
add_s = (k - i) * input->w_small + l - j;
__asm__ __volatile__(
"ldr %%r1, [%1];"
"ldr %%r2, [%2];"
"usada8 %0, %%r1, %%r2, %0;"
:"+r" (sad)
: "r" (input->pic_big + add_b), "r" (input->pic_small + add_s)
: "r1", "r2"
);
}
This is part of an image processing algorithm. The application runs 29.24 seconds on RPi 1 B and 7.65 seconds on RPi 2 B resulting in 3.82x speed-up. The question is, why? I understand, that there is an architectural change between the models, but I didn't find any reference regarding USADA8, that it should be significantly faster on ARMv7. Any ideas?
PS: Don't get me wrong, I am perfectly happy with the results, just being curious :)
There may be many reasons, but the main ones are probably (according to this ):
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.