简体   繁体   English

无法通过 OpenMPI 实现的 mpirun 命令运行 MPI 程序

[英]Can't run an MPI program through mpirun command of OpenMPI implementation

I have an MPI program (a code in c for a school project) that I want to run on more nodes (this time 2 nodes) but it doesn't work and it is infinitely waiting without any text/error.我有一个 MPI 程序(c 中的一个学校项目代码),我想在更多节点(这次是 2 个节点)上运行,但它不起作用,它会无限等待,没有任何文本/错误。 I am trying to run it on both machines with command mpirun -np 2 --host 192.168.0.1,192.168.0.2./mandelbrot_mpi_omp (ip addresses are just as placeholder, they are different in real and correct) on both nodes while providing the ip addresses in same order on both machines so the first one is always master with rank 0.我试图在两个节点上使用命令mpirun -np 2 --host 192.168.0.1,192.168.0.2./mandelbrot_mpi_omp在两台机器上运行它(IP 地址只是作为占位符,它们在真实和正确方面不同)同时提供ip 在两台机器上的地址顺序相同,因此第一个始终是等级为 0 的主机。

This MPI program main function code snippet (just in case... I don't think that here is the origin of MPI not working on more nodes, but I might be wrong.):这个 MPI 程序主要 function 代码片段(以防万一......我不认为这是 MPI 不能在更多节点上工作的起源,但我可能是错的。):

int main(int argc, char* argv[]){
    int width = SCALE_X;
    int height = SCALE_Y;

    // MPI init & setup
    MPI_Init(&argc, &argv);

    int world_size;
    int rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // calculate size of buffer according to server count
    int part_height = SCALE_Y/world_size;
    int buffer_size = (width+1)*(part_height+1)*3;

    // dynamically allocate arrays for image data according to server count
    send_buffer = calloc( buffer_size, sizeof(PIXEL));
    recv_buffer = calloc( buffer_size*world_size, sizeof(PIXEL));

    if(rank == 0) printf("MPI node count: %i\n", world_size);
    MPI_Barrier(MPI_COMM_WORLD);

    // OpenMP setup
    int cpu_count = omp_get_num_procs();
    omp_set_num_threads(cpu_count);
    printf("OpenMP cpu count on node %i: %i\n", rank, cpu_count);
    printf("OpenMP (max) thread count on node %i: %i\n", rank, omp_get_num_threads());
    MPI_Barrier(MPI_COMM_WORLD);

    // generate a part of mandelbrot set according to world size and rank of this server
    mandelbrot(rank, world_size, width, part_height);

    // gather parts of mandelbrot from all nodes
    MPI_Gather(send_buffer, (width)*(part_height)*3, MPI_CHAR, recv_buffer, (width)*(part_height)*3, MPI_CHAR, 0, MPI_COMM_WORLD);


    // save raster array of mandelbrot data to png file
    if(rank == 0) save_to_png(width, height);


    printf("Process %i finished.\n", rank);

    MPI_Finalize();

    return 0;
}

I am running OpenMPI from Debian repositories, and my OS is Debian 11. (on both machines)我从 Debian 存储库运行 OpenMPI,我的操作系统是 Debian 11。(在两台机器上)

I tried to change -np parameter for -n with no effect.我试图更改-n-np参数但没有效果。 If I run two processes on same machine with mpirun -np 2 --host 127.0.0.1,127.0.0.1./mandelbrot_mpi_omp then it works flawlessly, it launches two processes which will do their job fine.如果我使用mpirun -np 2 --host 127.0.0.1,127.0.0.1./mandelbrot_mpi_omp在同一台机器上运行两个进程,那么它会完美运行,它会启动两个进程,它们可以很好地完成工作。 If I stop the task on both computers with CTRL+Z (while inifnitely waiting and not actually running) then it gives me an error:如果我使用 CTRL+Z 停止两台计算机上的任务(无限等待而不是实际运行),那么它会给我一个错误:

ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   <hostname placeholder>
  target node:  <real ip here>

but those machines can communicate, i can ping them and connect to each other with ssh. They have same username and password.但是那些机器可以通信,我可以 ping 它们并使用 ssh 相互连接。它们具有相同的用户名和密码。

What am I missing?我错过了什么? Thanks in advance.提前致谢。

So the problem was that I couldn't login via ssh passwordless.所以问题是我无法通过 ssh 无密码登录。 Once I set it up to be possible to login to other pcs passwordless by generating pair of rsa keys on both machines, it works.一旦我将它设置为可以通过在两台机器上生成一对 rsa 密钥来无密码登录其他 pc,它就可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM