I want to convert the next function to NEON:
int dot4_c(unsigned char v0[4], unsigned char v1[4]){
int r=0;
r = v0[0]*v1[0];
r += v0[1]*v1[1];
r += v0[2]*v1[2];
r += v0[3]*v1[3];
return r;
}
I think I almost do it, but there is an error because it is not working well
int dot4_neon_hfp(unsigned char v0[4], unsigned char v1[4])
{
asm volatile (
"vld1.16 {d2, d3}, [%0] \n\t" //d2={x0,y0}, d3={z0, w0}
"vld1.16 {d4, d5}, [%1] \n\t" //d4={x1,y1}, d5={z1, w1}
"vcvt.32.u16 d2, d2 \n\t" //conversion
"vcvt.32.u16 d3, d3 \n\t"
"vcvt.32.u16 d4, d4 \n\t"
"vcvt.32.u16 d5, d5 \n\t"
"vmul.32 d0, d2, d4 \n\t" //d0= d2*d4
"vmla.32 d0, d3, d5 \n\t" //d0 = d0 + d3*d5
"vpadd.32 d0, d0 \n\t" //d0 = d[0] + d[1]
:: "r"(v0), "r"(v1) :
);
}
How can I get this working?
As mentioned, you must load at least 8 bytes at a time with NEON. As long as the load doesn't go past the end of your buffer, you can ignore the extra bytes. Here is how to do it with intrinsics:
uint8x8_t v0_vec, v1_vec;
uint16x8_t vproduct;
uint32x2_t vsum32;
v0_vec = vld1_u8(v0); // extra bytes will be ignored as long as you can safely read them
v1_vec = vld1_u8(v1);
// you didn't specify if the product of your vector fits in 8-bits, so I assume it needs to be widened to 16-bits
vproduct = vmull_u8(v0_vec, v1_vec);
vsum32 = vpaddl_u16(vget_low_u16(vproduct)); // pairwise add lower half (first 4 u16's)
return vsum32.val[0] + vsum32.val[1];
If you absolutely can't load 8 bytes from your source pointers, you can manually load a 32-bit value into a NEON register (the 4 bytes) and then cast it to the proper intrinsic type.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.