Optimizing C code without parallel programming

Question

I write a C code which is

for(i=1;i<10000;i++)
    x[i]=array1[h][x[i]^x[i-1]]

And

for(i=9999;i>0;i--)
    x[i]=x[i-1]^array2[h][x[i]]

Notes:

1- array1 and array2 are containing byte values

2- second loop does the opposite function of the first loop

3- h is a byte value and the same in loop1 and loop2

My question is

The second loop is faster than the first one, and I understand this, since in first loop every value in x depends on the new value of the previous byte, IE. To calculate x2 you must calculate x1, while in the second loop each byte depends on the old value of the previous byte which is already exists, IE. To calculate x9999 you need the old value of x9998 not the new one and so no need to wait the calculation of x9999, how this done in C code, and what is called, is parallel programming that means C language makes parallel programming for some loops that not sequential without of the user controlling and writing such parallel

The question is: Why is the 2. loop faster than the 1. loop ?

Thanks a lot

I am beginner in C code

Sorry for this question if it's too easy

Answer 1

Your first loop depends on the result of previous iterations. That means that, put simply, the processor can't start thinking about i=2 until it finishes i=1 , because x[2] depends on x[1] . However, the second loop does not depend on the result of the previous iterations.

Enabling compiler optimizations by adding the -O3 flag (that's a capital 'o', not a zero) may speed up both loops and bring them closer to the same speed. There are 'manual' optimizations like loop vectorization or working with wider data types that you can still implement, but try the -O3 flag first. Look at your IDE's help files for "compiler flags" if you don't know how to do this.

That said, it looks kind of like you're implementing some sort of encryption. In fact, this code looks like a stripped down version of a cipher like RC4. If that is what you're doing, I have a few warnings for you:

1) If you're writing encryption for production code, that you are depending on the security of, I suggest you use something from a well-known and tested library rather than writing your own, it will be faster and more secure.

2) If you're writing your own encryption algorithm for production code (rather than just "for fun"), please don't. There are more secure algorithms than anything that any one person can design, you don't gain anything through rolling your own.

3) If you're writing or implementing an algorithm for fun, good on you! Have a look at some real-world implementations once you finish yours, you might find some good ideas.

Answer 2

Most modern processors can break the order of instructions, and perform them out-of-order, based only on readiness of the source data. Think of a pool you pour the first ~50 iterations into at a steady state (probably faster than they execute) - how many can you start executing in parallel assuming you have multiple ALUs? In some cases you may even parallelize all your code, making you bounded by the number of execution resources (which may be very high). EDIT: important to notice that this becomes more difficult in complicated control flows (for eg if you had a bunch of if conditions in your loop, especially if they're data dependent), since you need to predict them and flush younger instructions if you were wrong..

A good compiler can also add on top of that loop unrolling and vectorization, which further enhances this parallelism and the execution BW that can be achieved from the CPU.

Dan is completely right about dependency (although it's not a simple "pipeline"). In the first loop, your x[i-1] of each iteration would be recognized as aliased with the x[i] of the previous one (by the CPU alias detection), making it a read-after-write scenario and forcing it to wait and forward the results (spanning across multiple iteration, this forms a long chain of dependency - while you can see iteration N, you can't execute it until you have done N-1, which waits for N-2, and so on..). By the way, this might get even nastier if complicated-to-forward cases, such as cache line split or page split accesses.

The second loop also uses the value in other cells, but there's an important difference - the program order first reads the value of x[i-1] (for calculating x[i]), and only afterwards writes x[i-1]. This changes the read-after-write into write-after-read which is much simpler since loads are done much earlier along the pipeline than stores. Now, the the processor is allowed to read all the values in advance (keep them somewhere in internal registers), and run the calculations in parallel. The writes are buffered and done at leisure, as no one depends on them.

EDIT: Another consideration in some cases is the memory access pattern, but in this case it looks like a simple stream pattern over array x (1-wide stride), either in positive or negative directions, but both can be easily recognized and the prefetcher should start firing ahead, so most of these accesses should hit the cache. array1/2 accesses on the other hand are complicated as they're determined by the results of the load - that would also stall your program a bit, but it's the same in both cases.

Answer 3

    for(i=1;i<10000;i++)
        x[i]=array1[h][x[i]^x[i-1]]

Each iteration of the for loop needs to get a value from array1. Whenever a value is accessed, data around this value, typically cache line size is read and stored in caches. Cache line sizes are different for L1 and L2 caches, I think they are 64 bytes and 128 bytes respectively. Next time when you access the same data or data around the previous value, you have a high probability of a cache hit which speeds up you operation by an order of magnitude.

Now, in the above for loop, x[i] ^ x[i-1] may evaluate to array indexes whose value does not lie within the size of the cache line for consecutive iterations. Lets take L1 cache for example. For the first iteration of the for loop, value array[h][x[i]^x[i-1]] is accessed, which is in main memory. 64 bytes of data surrounding this byte value is brought and stored in a cache line in L1 cache. For next iteration, x[i] ^ x[i-1] may lead to an index whose value is stored at a location not in the vicinity of the 64 bytes that was brought in the first iteration. Hence a cache miss and main memory is accessed again. This might happen many times during the execution of the for loop which results in poor performance.

Try to see what x[i] ^ x[i-1] evaluates to for each iteration. If they are vastly different, then the slowness in part is due to the reason as stated above.

The link below nicely explains this concept.

http://channel9.msdn.com/Events/Build/2013/4-329

Answer 4

In both cases, you should say unsigned char * aa = &array1[h]; (or array2[h] for the second loop). There's no point in hoping the compiler will lift that index operation, when you can do it and be sure.

The two loops are doing different things:

Loop 1 does x[i] ^ x[i-1] before indexing into aa , while Loop 2 indexes aa by x[i] before, and then performs ^ x[i-1] after.

Regardless, I would use pointers for x[i] and x[i-1] , and I would unroll the loop, so Loop 1 would look something like this:

unsigned char * aa = &array1[h];
unsigned char * px = &x[1];
unsigned char * px1 = &x[0];
for (i = 1; i < 10; i++){
   *px = aa[ *px ^ *px1 ]; px++; px1++;
}
for ( ; i < 10000; i += 10 ){
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
   *px = aa[ *px ^ *px1 ]; px++; px1++;
}

An alternative would be to use a single p pointer, and use hard offsets, like this:

unsigned char * aa = &array1[h];
unsigned char * px = &x[0];
for (i = 1; i < 10; i++){
   px[1] = aa[ px[1] ^ px[0] ]; px++;
}
for ( ; i < 10000; i += 10, px += 10 ){
   px[ 1] = aa[ px[ 1] ^ px[0] ];
   px[ 2] = aa[ px[ 2] ^ px[1] ];
   px[ 3] = aa[ px[ 3] ^ px[2] ];
   px[ 4] = aa[ px[ 4] ^ px[3] ];
   px[ 5] = aa[ px[ 5] ^ px[4] ];
   px[ 6] = aa[ px[ 6] ^ px[5] ];
   px[ 7] = aa[ px[ 7] ^ px[6] ];
   px[ 8] = aa[ px[ 8] ^ px[7] ];
   px[ 9] = aa[ px[ 9] ^ px[8] ];
   px[10] = aa[ px[10] ^ px[9] ];
}

I'm not sure which would be faster.

Again, some people will say the compiler's optimizer would do this for you, but there's no harm in helping it along.

Optimizing C code without parallel programming

Question

4 answers

solution1
2 2013-09-17 14:19:56

solution2
1 ACCPTED 2013-09-17 15:28:05

solution3
0 2013-09-17 16:17:41

solution4
0 2013-09-17 19:48:47

Optimizing C code without parallel programming

Question

4 answers

solution1 2 2013-09-17 14:19:56

solution2 1 ACCPTED 2013-09-17 15:28:05

solution3 0 2013-09-17 16:17:41

solution4 0 2013-09-17 19:48:47

solution1
2 2013-09-17 14:19:56

solution2
1 ACCPTED 2013-09-17 15:28:05

solution3
0 2013-09-17 16:17:41

solution4
0 2013-09-17 19:48:47