简体   繁体   中英

x86 assembly instructions optimisation

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.

This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)

Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).

C++ code :

  __m128* ip = (__m128*)iV;
  __m128* op = (__m128*)oV;
  __m128* A = (__m128*)iA;

  __m128 res1, res2, res3;
  int i;

  for (i=0; i<N; i++)
  {
    res1 = _mm_dp_ps(*A++, *ip, 0x71);
    res2 = _mm_dp_ps(*A++, *ip, 0x72);
    res3 = _mm_dp_ps(*A++, *ip++, 0x74);

    *op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
  }

The compiler generates these instructions :

000007FEE7DD4FE0  movaps      xmm2,xmmword ptr [rsi]               //move "ip" in register
000007FEE7DD4FE3  movaps      xmm1,xmmword ptr [rdi+10h]           //move second line of A in register
000007FEE7DD4FE7  movaps      xmm0,xmmword ptr [rdi+20h]           //move third line of A in register
000007FEE7DD4FEB  inc         r11d                                 //i++
000007FEE7DD4FEE  add         rbp,10h                              //op++
000007FEE7DD4FF2  add         rsi,10h                              //ip++
000007FEE7DD4FF6  dpps        xmm0,xmm2,74h                        //dot product of 3rd line of A against ip
000007FEE7DD4FFC  dpps        xmm1,xmm2,72h                        //dot product of 2nd line of A against ip
000007FEE7DD5002  orps        xmm0,xmm1                            //"merge" of the result of the two dot products
000007FEE7DD5005  movaps      xmm3,xmmword ptr [rdi]               //move first line of A in register
000007FEE7DD5008  add         rdi,30h                              //A+=3
000007FEE7DD500C  dpps        xmm3,xmm2,71h                        //dot product of 1st line of A against ip
000007FEE7DD5012  orps        xmm0,xmm3                            //"merge" of the result
000007FEE7DD5015  movaps      xmmword ptr [rbp-10h],xmm0           //move result in memory (op)
000007FEE7DD5019  cmp         r11d,dword ptr [rbx+28h]             //compare i
000007FEE7DD501D  jl          MyFunction+370h (7FEE7DD4FE0h)       //loop

I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?

For example, will it run faster if I change :

add         rbp,10h
movaps      xmmword ptr [rbp-10h],xmm0

by

movaps      xmmword ptr [rbp],xmm0
add         rbp,10h

I have also read that ADD instruction is faster than INC...

Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].

There is, however, a dependency between the add rbp,10h and [rbp-10h] , which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0 , xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0 , xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.

I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...

I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.

To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.

There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-

add         rbp,10h
movaps      xmmword ptr [rbp-10h],xmm0

but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.

You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.

Would it be worth it?

NO

You'd be lucky to see any real performance gain from any of this because your algorithm is

<blink> memory bandwidth limited </blink>*

which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).

If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).

There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-

  load set 1
mainloop:
  load set 2
  do processing on set 1
  save set 1 result
  load set 1
  do processing on set 2
  save set 2 result
  goto mainloop

Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.

  • RIP blink.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM