简体   繁体   中英

openMP Lack of Diminishing Returns with Higher Thread Count

My code right now has a loop which calls a Monte-Carlo function to calculate a simple integral (y=x, from 0 to 1) for multiple number of samples and writes the total time and integration value to a text file. Then the loop increments the number of threads and continues onward. Right now around 8 threads the time peaks around 2.6 seconds. The loop iterates upwards of 64 threads, and I see no slow down beyond .2 seconds, even sometimes a speed up.

For loop calling Monte-Carlo method, increment number of threads:

//this loop will iterate the main loop for a number of threads from 1 to 16
    for (int j = 1; j <= 17; j++)
    {
        //tell user how many threads are running monte-carlo currently
        cout << "Program is running " << number_threads << " thread(s) currently." << endl;

        //reset values for new run
        num_of_samples = 1;
        integration_result = 0;

        //this for loop will run throughout number of circulations running through monte-carlo
        //and entering the data into the text folder
        for (int i = 1; i <= iteration_num; i++)
        {
            //call monte carlo function to perform integration and write values to text
            monteCarlo(num_of_samples, starting_x, end_x, number_threads);

            //increase num of samples for next test round
            num_of_samples = 2 * num_of_samples;
        } //end of second for loop

        //iterate num_threads
        if (number_threads == 1)
            number_threads = 2;
        else if (number_threads >= 32)
            number_threads += 8;
        else if (number_threads >= 16)
            number_threads += 4;
        else
            number_threads += 2;
    } //end of for loop

Parallel portion for Monte-Carlo:

int num_threads;
    double x, u, error_difference, fs = 0, integration_result = 0; //fs is a placeholder to hold added values of f(x)
    vector< vector<double>> dataHolder(number_threads, vector<double>(1)); //this vector will hold temp values of each thread

    //get start time for parallel block of code
    double start_time = omp_get_wtime();

    omp_set_dynamic(0);     // Explicitly disable dynamic teams
    omp_set_num_threads(number_threads); // Use 4 threads for all consecutive parallel regions

#pragma omp parallel default(none) private(x, u) shared(std::cout, end_x, starting_x, num_of_samples, fs, number_threads, num_threads, dataHolder)
    {
        int i, id, nthrds;
        double temp = fs;

        //define thread id and num of threads
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();

        //initilialize random seed
        srand(id * time(NULL) * 1000);

        //if there is only one thread
        if(id == 0)
            num_threads = nthrds;

        //this for loop will calculate a temp value for fs for each thread
        for (int i = id; i < num_of_samples; i = i + nthrds)
        {
            //assign random number under integration from 0 to 1
            u = fRand(0, 1); //random number between 0 and 1
            x = starting_x + (end_x - starting_x) * u;

            //this line of code is from Monte_Carlo Method by Alex Godunov (February 2007)
            //calculuate y for reciporical value of x and add it to thread's local fs
            temp += function(x);
        }

        //place temp inside vector dataHolder
        dataHolder[id][0] = temp;

        //no thread will go beyond this barrier until task is complete
#pragma omp barrier

        //one thread will do this task
#pragma omp single
        {
            //add summations to calc fs
            for(i = 0, fs = 0.0; i < num_threads; i ++)
                fs += dataHolder[i][0];
        } //implicit barrier here, wait for all tasks to be done
    }//end of parallel block of code

After implementing the same sort of parallelization over a simple Monte-Carlo walk with light scattering, I was able to pick up on the diminished returns quite a bit. I think there is a lack of diminishing returns here due to the fact that the integration calculation being so simple, that the threads themselves have little to do separately, and thus their overhead is relatively little. If anyone else has any other information that would prove useful to this problem, please feel free to post. Otherwise I will accept this as my answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM