简体   繁体   English

MPI - MPI_Recv中的消息截断

[英]MPI - Message Truncation in MPI_Recv

I am having problems in one my project related to MPI development. 我在与MPI开发有关的项目中遇到问题。 I am working on the implementation of an RNA parsing algorithm using MPI in which I started the parsing of an input string based on some parsing rules and parsing table (contains different states and related actions) with a master node. 我正在使用MPI实现RNA解析算法,其中我开始基于一些解析规则解析输入字符串并使用主节点解析表(包含不同的状态和相关动作)。 In parsing table, there are multiple actions for each state which can be done in parallel. 在解析表中,每个状态有多个动作可以并行完成。 So, I have to distribute these actions among different processes. 所以,我必须在不同的流程中分配这些行动。 To do that, I am sending the current state and parsing info (current stack of parsing) to the nodes using separate thread to receive actions from other nodes while the main thread is busy in parsing based on received actions. 为此,我将当前状态和解析信息(解析的当前堆栈)发送到节点,使用单独的线程接收来自其他节点的操作,同时主线程忙于基于接收到的操作进行解析。 Following are the code snippets of the sender and receiver: 以下是发件人和收件人的代码段:

Sender Code: 发件人代码:

StackFlush(&snd_stack);
StackPush(&snd_stack, state_index);
StackPush(&snd_stack, current_ch);
StackPush(&snd_stack, actions_to_skip);
elements_in_stack = stack.top + 1;
for(int a=elements_in_stack-1;a>=0;a--)
                StackPush(&snd_stack, stack.contents[a]);
StackPush(&snd_stack, elements_in_stack);
elements_in_stack = parse_tree.top + 1;
for(int a=elements_in_stack-1;a>=0;a--)
                StackPush(&snd_stack, parse_tree.contents[a]);
StackPush(&snd_stack, elements_in_stack);
elements_in_stack = snd_stack.top+1;
MPI_Send(&elements_in_stack, 1, MPI_INT, (myrank + actions_to_skip) % mysize, MSG_ACTION_STACK_COUNT, MPI_COMM_WORLD);
MPI_Send(&snd_stack.contents[0], elements_in_stack, MPI_CHAR, (myrank + actions_to_skip) % mysize, MSG_ACTION_STACK, MPI_COMM_WORLD);

Receiver Code: 接收代码:

MPI_Recv(&e_count, 1, MPI_INT, MPI_ANY_SOURCE, MSG_ACTION_STACK_COUNT, MPI_COMM_WORLD, &status);
if(e_count == 0){
                break;
}
while((bt_stack.top + e_count) >= bt_stack.maxSize - 1){usleep(500);}
pthread_mutex_lock(&mutex_bt_stack); //using mutex for accessing shared data among threads
MPI_Recv(&bt_stack.contents[bt_stack.top + 1], e_count, MPI_CHAR, status.MPI_SOURCE, MSG_ACTION_STACK, MPI_COMM_WORLD, &status);
bt_stack.top += e_count;
pthread_mutex_unlock(&mutex_bt_stack);

The program is running fine for small input having less communications but as we increase the input size which in response increases the communication so the receiver receives many requests while processing few then it get crashed with the following errors: 该程序运行良好,适用于通信较少的小输入,但是当我们增加输入大小时响应增加了通信,因此接收器接收到许多请求,而处理很少,然后它会因以下错误而崩溃:

Fatal error in MPI_Recv: Message truncated, error stack: MPI_Recv(186) ……………………………………: MPI_Recv(buf=0x5b8d7b1, count=19, MPI_CHAR, src=3, tag=1, MPI_COMM_WORLD, status=0x41732100) failed MPIDI_CH3U_Request_unpack_uebuf(625)L Message truncated; MPI_Recv中的致命错误:消息被截断,错误堆栈:MPI_Recv(186)..........................................:MPI_Recv(buf = 0x5b8d7b1,count = 19,MPI_CHAR,src = 3,tag = 1, MPI_COMM_WORLD,status = 0x41732100)失败MPIDI_CH3U_Request_unpack_uebuf(625)L消息被截断; 21 bytes received but buffer size is 19 Rank 0 in job 73 hpc081_56549 caused collective abort of all ranks exit status of rank 0: killed by signal 9. 接收到21个字节但缓冲区大小为19作业73中的等级0 hpc081_56549导致等级0的所有等级退出状态的集体中止:被信号9杀死。

I have also tried this by using Non-Blocking MPI calls but still the similar errors. 我也通过使用非阻塞MPI调用尝试了这一点,但仍然是类似的错误。

I don't know what the rest of the code looks like, but here's an idea. 我不知道代码的其余部分是什么,但这是一个想法。 Since there is a break I'm assuming the receiver code is part of a loop or a switch statement. 由于存在break我假设接收器代码是循环或switch语句的一部分。 If that's the case, there is a mismatch between sends and receives when the element count becomes 0: 如果是这种情况,当元素计数变为0时,发送和接收之间存在不匹配:

  1. The sender will send the element count and a zero-length message (the MPI_Send(&snd_stack.contents... line). 发送方将发送元素计数和零长度消息( MPI_Send(&snd_stack.contents...行)。
  2. There will be no matching receive for this second message because the receiver breaks out of the loop. 第二条消息没有匹配的接收,因为接收器突然出现循环。
  3. The zero-length message will then match something else, possibly causing the error you are seeing down the line. 零长度消息将匹配其他内容,可能导致您看到的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM