如果使用更多進程，帶 C 的 MPI 會變慢

Question

我正在使用 C 學習 MPI，並根據此鏈接中提供的代碼編寫了代碼： http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml 。

在此代碼中，對包含1e8值的向量求和。 但是，我觀察到當使用更多進程時，運行時間會越來越長。 代碼如下：

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){
        // Root process: Define vector size, how to split vector and send information to workers
        vec_size = 1e8; // size of main vector

        for(i = 0; i < vec_size; i++){
            //vec1[i] = pow(-1.0,i+2)/(2.0*(i+1)-1.0); // defining main vector...  Correct answer for total sum = 0.78539816339
            vec1[i] = pow(i,2)+1.0; // defining main vector... 
            //printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
        }

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // spliting and sending the values
        
        for(an_id = 1; an_id < num_proc; an_id++){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_send = rows_per_proc + leftover; // counting the amount of data to be sent.
                start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
            }
            else{
                num_2_send = rows_per_proc;
                start_point = (an_id - 1)*num_2_send + leftover; // starting point for other processes if there is leftover.
            }
            
            ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
            ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
        }

        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id++){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum + partial_sum;
        }

        printf("Total sum = %f.\n", sum);

    }
    else{
        // Workers:define which operation will be carried out by each one
        ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
        ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
        
        partial_sum = 0;
        for(i=0; i < num_2_recv; i++){
            //printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
            partial_sum = partial_sum + vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
        
    }

    ierr = MPI_Finalize();
    
}

觀察：編譯為


mpicc -o vector_sum vector_send.c -lm

並運行為：

time mpirun -n x vector_sum

x = 2和5 。 您將看到x=5需要更多時間來運行。

我做錯什么了嗎？ 我沒想到它會變慢，因為每個塊的總和是獨立的。 還是程序如何為每個進程發送信息的問題？ 在我看來，為每個進程發送信息的循環是造成這個較長時間的原因。

Answer 1

首先，您需要使用減少而不是單獨的發送/接收。 接下來，對於您執行 send&recv 的每個進程，時間將與進程數呈線性關系，因為每個進程僅在前一個進程完成時才獲取數據。

如果你堅持使用發送和接收，你需要使用非阻塞的。

Answer 2

正如 Gilles Gouaillardet ( https://stackoverflow.com/users/8062491/gilles-gouaillardet ) 所建議的那樣：我修改了代碼以在每個進程中生成向量片段，而不是從根進程傳遞它們。 有效。 現在，對於更多進程，經過的時間更短：我在下面發布新代碼：

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will generate and calculate the partial sum of the vector values and send it back to the root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){

        vec_size = 1e8; // defining main vector size

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // defining the number of data and position corresponding to main vector
        
        for(an_id = 1; an_id < num_proc; an_id++){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_gen = rows_per_proc + leftover; // counting the amount of data to be generated.
                start_point = (an_id - 1)*num_2_gen; // defining corresponding initial position in the main vector.
            }
            else{
                num_2_gen = rows_per_proc;
                start_point = (an_id - 1)*num_2_gen + leftover; // defining corresponding initial position in the main vector for other processes if there is leftover.
            }

            ierr = MPI_Send(&num_2_gen, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data must be generated.
            ierr = MPI_Send(&start_point, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of initial positions on main vector.
        }
        
        
        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id++){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum + partial_sum;
        }

        printf("Total sum = %f.\n", sum);
        
    }
    else{
        ierr = MPI_Recv(&num_2_gen, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must generate.
        ierr = MPI_Recv(&start_point, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of initial positions.

        // generate and sum vector pieces
        partial_sum = 0;
        for(i = start_point; i < start_point + num_2_gen; i++){
            vec2[i] = pow(i,2)+1.0;
            partial_sum = partial_sum + vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
               
    }

    ierr = MPI_Finalize();
    return 0;
    
}

在這個新版本中，不是傳遞主向量片段，而是僅傳遞如何在每個進程中生成這些片段的信息。

Answer 3

使用 MPI_Reduce() 的新代碼比前一個更快更簡單：

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.

Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if x = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process;
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    vec_size = 1e8; // defining the main vector size
    
    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
    rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
    leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.

    if(my_id == 0){
        num_2_gen = rows_per_proc + leftover; // if there is leftover, it is calculate in process 0
        start_point = my_id*num_2_gen; // the corresponding position on the main vector
    }
    else{
        num_2_gen = rows_per_proc;
        start_point = my_id*num_2_gen + leftover; // the corresponding position on the main vector
    }

    partial_sum = 0;
    for(i = start_point; i < start_point + num_2_gen; i++){
        vec2[i] = pow(i,2) + 1.0; // defining vector values
        partial_sum += vec2[i]; // calculating partial sum
    }

    printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);

    MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum

    if(my_id == root_process){
        printf("Total sum is %f.\n", sum);
    }

    ierr = MPI_Finalize();
    return 0;
    
}

如果使用更多進程，帶 C 的 MPI 會變慢

問題描述

2 個解決方案

解決方案1
0 2022-09-20 17:24:03

解決方案2
0 2022-09-21 12:28:03

解決方案3
0 2022-09-21 14:26:22

如果使用更多進程，帶 C 的 MPI 會變慢

問題描述

2 個解決方案

解決方案1 0 2022-09-20 17:24:03

解決方案2 0 2022-09-21 12:28:03

解決方案3 0 2022-09-21 14:26:22

解決方案1
0 2022-09-20 17:24:03

解決方案2
0 2022-09-21 12:28:03

解決方案3
0 2022-09-21 14:26:22