简体   繁体   中英

C++ Multithreading nested for loops

First off, I know very little about multithreading and I am having troubles finding how the best way to optimize this code, but multithreading seems the path I should be on.

double
applyFilter(struct Filter *filter, cs1300bmp *input, cs1300bmp *output)
{
    long long cycStart, cycStop;

    cycStart = rdtscll();

    output -> width = input -> width;
    output -> height = input -> height;

    int temp1 = output -> width;
    int temp2 = output -> height;

    int width=temp1-1;
    int height=temp2 -1;
    int getDivisorVar= filter -> getDivisor();  
    int t0, t1, t2, t3, t4, t5, t6, t7, t8, t9;

    int keep0= filter -> get(0,0);
    int keep1= filter -> get(1,0);
    int keep2= filter -> get(2,0);
    int keep3= filter -> get(0,1);
    int keep4= filter -> get(1,1);
    int keep5= filter -> get(2,1);
    int keep6= filter -> get(0,2);
    int keep7= filter -> get(1,2);
    int keep8= filter -> get(2,2);


    //Declare variables before the loop
    int plane, row, col;    

    for (plane=0; plane < 3; plane++) {
        for(row=1; row < height ; row++) {
            for (col=1; col < width; col++) {

                t0 = (input -> color[plane][row - 1][col - 1]) * keep0;
                t1 = (input -> color[plane][row][col - 1]) * keep1;
                t2 = (input -> color[plane][row + 1][col - 1]) * keep2;
                t3 = (input -> color[plane][row - 1][col]) * keep3;
                t4 = (input -> color[plane][row][col]) * keep4;
                t5 = (input -> color[plane][row + 1][col]) * keep5;
                t6 = (input -> color[plane][row - 1][col + 1]) * keep6;
                t7 = (input -> color[plane][row][col + 1]) * keep7;
                t8 = (input -> color[plane][row + 1][col + 1]) * keep8;

                // NEW LINE HERE

                t9 = t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
                t9 = t9 / getDivisorVar;

                if ( t9 < 0 ) {
                    t9 = 0;
                }

                if ( t9  > 255 ) {
                    t9 = 255;
                } 

                output -> color[plane][row][col] = t9;
            } ....

All of this code most likely isn't necessary, but it does provide some context. So because the first of the 3 "for" loop only goes from 0-2 I was hoping there was a way I could thread the bottom two "for" loops to all be running at the same time for a different "plane" value. Is this even possible? And if so, would it actually make my program faster?

I would also look into OpenMP. It is a great library that allows for threading in a VERY simple manner using pragmas. OpenMP is compileable on many platforms, you just have to make sure yours supports it!

I have a set of code that had 8 levels of for loops, and it threaded it very nicely.

Yes, it's perfectly possible. In this case, you should event get away without worrying about access synchronisation (ie race conditions), as both threads would be operating on different sets of data.

This would definitely speed up your code on a multicore machine.

You might want to have a look at std::thread (if you're ok with c++ 11) for cross platform threading implementation (since you haven't specified your target platform). Or better with threading support library

You may also think about detecting number of cores and launch appropriate number of threads, as in threadcount = min(planes, cores) and provide each worker function with access to single plane's set of data.

It certainly looks like you could break this into thread and you would probably see a good speed increase. However, your compiler will already be trying to unroll the loop for you and gain parallelism by vectorizing instructions. Your gains may not be as much as you suspect, especially if you are saturating the memory bus with reads from disparate locations.

What you might consider is, if this is a 2d graphics operation, try and use OpenGL or similar as it will leverage hardware on your system, and that has some parallelism built into it.

Threaded version of the code will be slower than simple implementation. Because in the threaded version there will be much time spend on synchronisation. Also in threaded version you will have cache performance drawback.

Also it is high probability, that outer for loop with 3 passes will be unrolled by complier and will be executed in parallel.

You can try to make threaded version and compare performance. Anyway it will be usefull experience.

For a situation like this you could do worse than use a compiler that automatically turns for loops into threads.

With code like that the compiler can determine whether or not there is any inter-iteration data dependency. If not then it knows that it can safely split the for loop(s) across multiple threads, putting bog standard thread syncing at the end. Normally such a compiler is able to insert code that determines at run time whether the overhead of having the threads is going to be outweighed by the benefits.

The only thing is, do you have a compiler that does it? If so then its by far the easiest way to get the benefits of threads for straightforward, almost overt parallelism like this.

I know that Sun's C compiler does it (I think they were one of the earliest to do this. It might be on only the Solaris version of their compiler). I think that Intel's compiler can too. I have doubts about GCC (though I'd be very happy to be corrected on that point), and I'm not too sure about Microsoft's compiler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM