vector shift using pointers

Question

I'm in the process of optimizing my code using SSE3. There's one point in the code that is forcing me to shift all of the elements in an vector by one element

v[0] = 0   //v is some char* and N = v.size()
for(int i = 1;i<N;i++){
    v[i] = v[i-1];
}

As far as I can tell, SSE doesn't support vector shifting, so I'll have to code this one from scratch.

But then I had the idea, what if I just decrement the pointer.

v = (v-1); 
v[0] = 0;

In this way, the operation will be constant won't require any operations at all.

I've already tested this and it works for my test program.
However, I'm not sure that this operation is safe.

Is this a really dumb idea?

Answer 1

SSE does support shifting, either bitwise shifting of the elements inside a vector and shifting of whole registers along byte boundaries too.

Assuming your vector is of type 16 times uint8_t , the operation you are looking for is

psrldq xmm, 1      ;packed shift right logical double quad word

with the intrinsic

vec = _mm_srli_si128(vec, 1);   // shift by 1 byte

To your first question: As long as v is a pointer to char, decrementing or incrementing it is completely safe. Dereferencing may not, that depends on your program.

To your second question: Yes, it looks like a dumb idea. If you try to optimize with SSE and you perform some tasks with pointers to bytes you are most likely doing something wrong, and you are calling for trouble if you try to load 16 of your v into a SSE register - either segfaults because of misalignment or a performance penalty because of forcing the compiler to use movdqu .

Answer 2

Simplest answer: instead of the loop you posted, use memmove(v+1, v, N-1). This is likely to run as fast as hand-coded assembly on any decent system, because it is hand-coded assembly, using the proper mix of movdqu/movdqa/movntdqa and loop unrolling.

More complicated answer: I think, looking at the bigger picture, that it is very unlikely you actually need to shift the data . Much more likely, you may need to access a neighboring element and the current element, for example do some kind of calculation on both v[i] and v[i-1].

If you are using SIMD code to do that, the standard technique is to (for example) load bytes 0..15 into xmm0, 16..31 into xmm1, and then shuffle both registers to end up with elements 1..16 in xmm2. Then you can do the calculation with xmm0 (here corresponding to vectorized v[i-1]) and xmm2 (vectorized v[i]). This is not "shift" in the sense of logical/arithmetic shift, but rather a SIMD lane shift.

Example: working with bytes in assembly

movdqa mem, xmm0 // load bytes 0..15
loop:
// increment mem by 16
movdqa mem, xmm1 // load bytes 16..31
movdqa xmm0, xmm2 // make a copy
movdqa xmm1, xmm3 // make a copy
psrldq xmm2, 1 // ends up with bytes 1..15 and a zero
pslldq xmm3, 15 // ends up with zeros and byte 16
por xmm2, xmm3 // ends up with bytes 1..16
// do something with xmm3 and xmm0 here, they contain bytes 1..16 and 0..15 respectively
// in other words xmm3 is a lane-shifted
movdqa xmm1, xmm0 // use our copy of bytes 16..31 to continue the loop
// goto loop

Why not do this: "what if I just decrement the pointer ... v = (v-1);"

This will crash:

char* v = (char*)malloc(...);
v=(v-1);
v[0] = 0; // or any read or write of v[0]

If v points to somewhere in the middle of (not the beginning of) a block of allocated memory, then decrement will work fine, but you have to have a way of being sure that is always the case (for example, the memory is allocated in the same function that will use this trick).

Answer 3

Decrementing the pointer will first cause an out of bounds access on the 0th element, and it will misalign your vector. Vector operations except data to be properly aligned to be performant. If the data is not aligned the instruction scheduler has to split up the read from memory into two fetches, loosing you some performance.

SSE offers bit shift operations on whole vectors, see @hirschhornsalz' answer.

vector shift using pointers

Question

3 answers

solution1
4 ACCPTED 2012-11-16 08:23:20

solution2
2 2012-11-16 09:41:50

solution3
0 2012-11-16 10:18:00

vector shift using pointers

Question

3 answers

solution1 4 ACCPTED 2012-11-16 08:23:20

solution2 2 2012-11-16 09:41:50

solution3 0 2012-11-16 10:18:00

solution1
4 ACCPTED 2012-11-16 08:23:20

solution2
2 2012-11-16 09:41:50

solution3
0 2012-11-16 10:18:00