简体   繁体   English

创建一个在MPI进程中保持同步的计数器

[英]Creating a counter that stays synchronized across MPI processes

I have quite a bit of experience using the basic comm and group MPI2 methods, and do quite a bit of embarrassingly parallel simulation work using MPI. 我在使用基本通信和组MPI2方法方面有相当多的经验,并且使用MPI进行了相当多的令人尴尬的并行仿真工作。 Up until now, I have structured my code to have a dispatch node, and a bunch of worker nodes. 到目前为止,我已经构建了我的代码,以便有一个调度节点和一堆工作节点。 The dispatch node has a list of parameter files that will be run with the simulator. 调度节点具有将与模拟器一起运行的参数文件列表。 It seeds each worker node with a parameter file. 它使用参数文件为每个工作节点播种。 The worker nodes run their simulation, then request another parameter file, which the dispatch node provides. 工作节点运行其模拟,然后请求调度节点提供的另一个参数文件。 Once all parameter files have been run, the dispatch node shuts down each worker node, before shutting itself down. 运行所有参数文件后,调度节点会在关闭自身之前关闭每个工作节点。

The parameter files are typically named "Par_N.txt" where N is the identifying integer (eg- N = 1-1000). 参数文件通常命名为“Par_N.txt”,其中N是标识整数(例如,-N = 1-1000)。 So I was thinking, if I could create a counter, and could have this counter synchronized across all of my nodes, I could eliminate the need to have a dispatch node, and make the system a bit more simple. 所以我在想,如果我可以创建一个计数器,并且可以在所有节点上同步这个计数器,我就可以省去调度节点,并使系统更简单一些。 As simple as this sounds in theory, in practice I suspect it is a bit more difficult, as I'd need to ensure the counter is locked while being changed, etc.. And thought there might be a built-in way for MPI to handle this. 理论上这听起来很简单,在实践中我怀疑它有点困难,因为我需要确保计数器在被改变时被锁定等等。并且认为可能存在MPI的内置方式处理这个(事情。 Any thoughts? 有什么想法吗? Am I over thinking this? 我在想这个吗?

Implementing a shared counter isn't trivial, but once you do it and have it in a library somewhere you can do a lot with it. 实现共享计数器并非易事,但是一旦你执行它并将它放在某个库中,你可以用它做很多事情。

In the Using MPI-2 book, which you should have to hand if you're going to implement this stuff, one of the examples (the code is available online ) is a shared counter. 使用MPI-2的书中,如果你要实现这些东西,你应该亲自动手,其中一个例子(代码可在线获得 )是一个共享计数器。 The "non-scalable" one should work well out to several dozens of processes -- the counter is an array of 0..size-1 of integers, one per rank, and then the `get next work item #' operation consists of locking the window, reading everyone elses' contribution to the counter (in this case, how many items they've taken), updating your own (++), closing the window, and calculating the total. “不可扩展”的应该可以运行到几十个进程 - 计数器是一个0..size-1整数的数组,每个等级一个,然后`get next work item#'操作包括锁定窗口,阅读每个人对计数器的贡献(在这种情况下,他们已经采取了多少项),更新自己的(++),关闭窗口,并计算总数。 This is all done with passive one-sided operations. 这一切都是通过被动单侧操作完成的。 (The better-scaling one just uses a tree rather than a 1-d array). (更好的缩放只使用树而不是1-d数组)。

So the use would be you have say rank 0 host the counter, and everyone keeps doing work units and updating the counter to get the next one until there's no more work; 所以使用你会说排名0主持柜台,并且每个人都继续做工作单位并更新计数器以获得下一个计数器,直到没有更多的工作; then you wait at a barrier or something and finalize. 然后你在障碍物或其他东西等待并最终确定。

Once you have something like this - using a shared value to get the next work unit available - working, then you can generalize to more sophisticated approach. 一旦你有这样的东西 - 使用共享值来获得下一个工作单位 - 工作,那么你可以推广到更复杂的方法。 So as suzterpatt suggested, everyone taking "their share" of work units at the start works great, but what to do if some finish faster than others? 因此,正如suzterpatt建议的那样,每个人在开始时“分享”工作单位的工作都很好,但是如果一些人比其他人完成得更快,该怎么办? The usual answer now is work-stealing; 现在通常的答案是偷工作; everyone keeps their list of work units in a dequeue, and then when one runs out of work, it steals work units from the other end of someone elses dequeue, until there's no more work left. 每个人都将他们的工作单位列在一个队列中,然后当一个人没有工作时,它会从另一个人的另一端偷走工作单位,直到没有剩下的工作为止。 This is really the completely-distributed version of master-worker, where there's no more single master partitioning work. 这实际上是master-worker的完全分布式版本,其中不再有单个主分区工作。 Once you have a single shared counter working, you can make mutexes from those, and from that you can implement the dequeue. 一旦你有一个共享计数器工作,你可以从那些互斥量,并从中可以实现出列。 But if the simple shared-counter works well enough, you may not need to go there. 但是,如果简单的共享计数器运行良好,您可能不需要去那里。

Update: Ok, so here's a hacky-attempt at doing the shared counter - my version of the simple one in the MPI-2 book: seems to work, but I wouldn't say anything much stronger than that (haven't played with this stuff for a long time). 更新:好的,所以这是一个hacky尝试共享计数器 - 我在MPI-2书中的简单版本的版本:似乎工作,但我不会说比这更强大的东西(没有玩过这个东西很长一段时间)。 There's a simple counter implementation (corresponding to the non-scaling version in the MPI-2 book) with two simple tests, one corresponding roughly to your work case; 有一个简单的计数器实现(对应于MPI-2书中的非缩放版本),有两个简单的测试,一个大致对应于你的工作案例; each item updates the counter to get a work item, then does the "work" (sleeps for random amount of time). 每个项目更新计数器以获得工作项目,然后执行“工作”(睡眠随机时间量)。 At the end of each test, the counter data structure is printed out, which is the # of increments each rank has done. 在每次测试结束时,打印出计数器数据结构,这是每个等级完成的增量数。

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

struct mpi_counter_t {
    MPI_Win win;
    int  hostrank ;
    int  myval;
    int *data;
    int rank, size;
};

struct mpi_counter_t *create_counter(int hostrank) {
    struct mpi_counter_t *count;

    count = (struct mpi_counter_t *)malloc(sizeof(struct mpi_counter_t));
    count->hostrank = hostrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &(count->rank));
    MPI_Comm_size(MPI_COMM_WORLD, &(count->size));

    if (count->rank == hostrank) {
        MPI_Alloc_mem(count->size * sizeof(int), MPI_INFO_NULL, &(count->data));
        for (int i=0; i<count->size; i++) count->data[i] = 0;
        MPI_Win_create(count->data, count->size * sizeof(int), sizeof(int),
                       MPI_INFO_NULL, MPI_COMM_WORLD, &(count->win));
    } else {
        count->data = NULL;
        MPI_Win_create(count->data, 0, 1,
                       MPI_INFO_NULL, MPI_COMM_WORLD, &(count->win));
    }
    count -> myval = 0;

    return count;
}

int increment_counter(struct mpi_counter_t *count, int increment) {
    int *vals = (int *)malloc( count->size * sizeof(int) );
    int val;

    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, count->hostrank, 0, count->win);

    for (int i=0; i<count->size; i++) {

        if (i == count->rank) {
            MPI_Accumulate(&increment, 1, MPI_INT, 0, i, 1, MPI_INT, MPI_SUM,
                           count->win);
        } else {
            MPI_Get(&vals[i], 1, MPI_INT, 0, i, 1, MPI_INT, count->win);
        }
    }

    MPI_Win_unlock(0, count->win);
    count->myval += increment;

    vals[count->rank] = count->myval;
    val = 0;
    for (int i=0; i<count->size; i++)
        val += vals[i];

    free(vals);
    return val;
}

void delete_counter(struct mpi_counter_t **count) {
    if ((*count)->rank == (*count)->hostrank) {
        MPI_Free_mem((*count)->data);
    }
    MPI_Win_free(&((*count)->win));
    free((*count));
    *count = NULL;

    return;
}

void print_counter(struct mpi_counter_t *count) {
    if (count->rank == count->hostrank) {
        for (int i=0; i<count->size; i++) {
            printf("%2d ", count->data[i]);
        }
        puts("");
    }
}

int test1() {
    struct mpi_counter_t *c;
    int rank;
    int result;

    c = create_counter(0);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    result = increment_counter(c, 1);
    printf("%d got counter %d\n", rank, result);

    MPI_Barrier(MPI_COMM_WORLD);
    print_counter(c);
    delete_counter(&c);
}


int test2() {
    const int WORKITEMS=50;

    struct mpi_counter_t *c;
    int rank;
    int result = 0;

    c = create_counter(0);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    srandom(rank);

    while (result < WORKITEMS) {
        result = increment_counter(c, 1);
        if (result <= WORKITEMS) {
             printf("%d working on item %d...\n", rank, result);
             sleep(random() % 10);
         } else {
             printf("%d done\n", rank);
         }
    }

    MPI_Barrier(MPI_COMM_WORLD);
    print_counter(c);
    delete_counter(&c);
}

int main(int argc, char **argv) {

    MPI_Init(&argc, &argv);

    test1();
    test2();

    MPI_Finalize();
}

I can't think of any built-in mechanism to solve that problem, you'd have to implement it manually. 我想不出任何内置机制来解决这个问题,你必须手动实现它。 Judging by your comments you want to decentralize the program, in which case each process (or at least groups of processes) would have to keep their own values of the counter and keep it synchronized. 根据您的意见判断您希望分散程序,在这种情况下,每个进程(或至少一组进程)必须保留自己的计数器值并使其保持同步。 This could probably be done with clever use of non-blocking sends/receives, but the semantics of those are not trivial. 这可以通过巧妙地使用非阻塞发送/接收来完成,但这些语义不是微不足道的。

Instead, I'd resolve the saturation issue by simply issuing several files at once to worker processes. 相反,我只需一次向工作进程发出几个文件即可解决饱和问题。 This would reduce network traffic and allow you to keep your simple single dispatcher setup. 这将减少网络流量,并允许您保持简单的单一调度程序设置。

It seems like you are using your dispatch node to do dynamic load balancing (assigning work to processors when they become available). 您似乎正在使用调度节点进行动态负载平衡(在处理器可用时将工作分配给处理器)。 A shared counter that doesn't require all of the processors to stop will not do that. 不需要所有处理器停止的共享计数器将不会这样做。 I would recommend staying with what you have now or do what suszterpatt suggests, send batches of files out at a time. 我建议保留你现在拥有的东西,或者做什么suszterpatt建议,一次发送批量文件。

It's not clear if there is a need to go through the files in strict order or not. 目前尚不清楚是否需要严格按顺序查看文件。 If not, why not just have each node i handle all files where N % total_workers == i --that is, cyclic distribution of work? 如果不是,为什么不干脆让每个节点i处理所有的文件,其中N % total_workers == i凹口-是,工作循环分布?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM