MPI_SEND占据了虚拟内存的很大一部分

Question

Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory . 在大量内核上调试我的程序，我遇到了insufficient virtual memory非常奇怪的错误。 My investigations lead to peace of code, where master sends small messages to each slave. 我的调查导致代码的和平，主机向每个从机发送小消息。 Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV . 然后我写了一个小程序，其中1个master只用MPI_SEND发送10个整数，所有从站用MPI_RECV接收它。 Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! MPI_SEND之前和之后的文件/proc/self/status MPI_SEND表明，内存大小之间的差异是巨大的！ The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send and still take huge space. 最有趣的事情（崩溃我的程序）是，这个内存在MPI_Send之后不会释放，并且仍然需要占用大量空间。

Any ideas? 有任何想法吗？

 System memory usage before MPI_Send, rank: 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   251400 kB                                                                                 
VmSize:   186628 kB                                                                                 
VmLck:        72 kB                                                                                  
VmHWM:      4068 kB                                                                                  
VmRSS:      4068 kB                                                                                  
VmData:    71076 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       148 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3                                                                                          

 System memory usage after MPI_Send, rank 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   456880 kB                                                                                 
VmSize:   456872 kB                                                                                 
VmLck:    257884 kB                                                                                  
VmHWM:    274612 kB                                                                                  
VmRSS:    274612 kB                                                                                  
VmData:   341320 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       676 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3

Answer 1

This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. 这是几乎任何在InfiniBand上运行的MPI实现的预期行为。 The IB RDMA mechanisms require that data buffers should be registered, ie they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. IB RDMA机制要求应该注册数据缓冲区，即它们首先被锁定在物理内存中的固定位置，然后驱动程序告诉InfiniBand HCA如何将虚拟地址映射到物理内存。 It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. 这是非常复杂的，因此注册内存以供IB HCA使用的过程非常缓慢，这就是为什么大多数MPI实现永远不会注册曾经注册的内存，希望以后将相同的内存再次用作源或数据目标。 If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size. 如果已注册的内存是堆内存，则它永远不会返回到操作系统，这就是为什么您的数据段只会增大。

Reuse send and receive buffers as much as possible. 尽可能重用发送和接收缓冲区。 Keep in mind that communication over InfiniBand incurrs high memory overhead. 请记住，通过InfiniBand进行通信会导致高内存开销。 Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. 大多数人并没有真正考虑这个问题，而且通常记录很少，但InfiniBand使用了许多特殊的数据结构（队列），这些结构在进程的内存中分配，并且这些队列随着进程的数量而显着增长。 In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application. 在一些完全连接的情况下，队列内存的数量可能很大，以至于实际上没有为应用程序留下任何内存。

There are certain parameters that control IB queues used by Intel MPI. 有一些参数可以控制英特尔MPI使用的IB队列。 The most important in your case is I_MPI_DAPL_BUFFER_NUM which controls the amount of preallocated and preregistered memory. 在您的情况下，最重要的是I_MPI_DAPL_BUFFER_NUM ，它控制预分配和预注册内存的数量。 It's default value is 16 , so you might want to decrease it. 它的默认值是16 ，所以你可能想减少它。 Be aware of possible performance implications though. 但请注意可能的性能影响。 You can also try to use dynamic preallocated buffer sizes by setting I_MPI_DAPL_BUFFER_ENLARGEMENT to 1 . 您还可以通过将I_MPI_DAPL_BUFFER_ENLARGEMENT设置为1来尝试使用动态预分配缓冲区大小。 With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. 启用此选项后，Intel MPI最初会注册小缓冲区，如果需要，稍后会增加它们。 Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call to MPI_Send . 另请注意，IMPI会延迟打开连接，这就是为什么只有在调用MPI_Send之后才会看到使用内存大幅增加的原因。

If not using the DAPL transport, eg using the ofa transport instead, there is not much that you can do. 如果不使用DAPL运输，例如，使用ofa运输代替，没有太多可以做的。 You can enable XRC queues by setting I_MPI_OFA_USE_XRC to 1 . 您可以通过将I_MPI_OFA_USE_XRC设置为1来启用XRC队列。 This should somehow decrease the memory used. 这应该以某种方式减少使用的内存。 Also enabling dynamic queue pairs creation by setting I_MPI_OFA_DYNAMIC_QPS to 1 might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks). 如果程序的通信图形未完全连接（完全连接的程序是每个级别与所有其他级别对话的程序），则通过将I_MPI_OFA_DYNAMIC_QPS设置为1来启用动态队列对创建可能会降低内存使用量。

Answer 2

Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. Hristo的答案大多是正确的，但由于你使用的是小消息，所以有一点不同。 The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. 消息最终出现在急切路径上：它们首先被复制到已经注册的缓冲区，然后该缓冲区用于传输，接收方将消息从其末端的急切缓冲区中复制出来。 Reusing buffers in your code will only help with large messages. 在代码中重用缓冲区只会有助于处理大型邮件。

This is done precisely to avoid the slowness of registering the user-supplied buffer. 这样做是为了避免注册用户提供的缓冲区的速度慢。 For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead. 对于大型消息，副本比注册时间更长，因此使用集合协议。

These eager buffers are somewhat wasteful. 这些急切的缓冲区有点浪费。 For example, they are 16kB by default on Intel MPI with OF verbs. 例如，默认情况下，它们在具有OF动词的英特尔MPI上为16kB。 Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. 除非使用消息聚合，否则每个10-int大小的消息正在占用4个4kB页面。 But aggregation won't help when talking to multiple receivers anyway. 但无论如何，在与多个接收器通信时，聚合将无济于事。

So what to do? 那么该怎么办？ Reduce the size of the eager buffers. 减少热切缓冲区的大小。 This is controlled by setting the eager/rendezvous threshold ( I_MPI_RDMA_EAGER_THRESHOLD environment variable). 这可以通过设置eager / rendezvous阈值（ I_MPI_RDMA_EAGER_THRESHOLD环境变量）来控制。 Try 2048 or even smaller. 尝试2048甚至更小。 Note that this can result in a latency increase. 请注意，这可能会导致延迟增加。 Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. 或者更改I_MPI_DAPL_BUFFER_NUM变量以控制这些缓冲区的数量，或者尝试Hristo建议的动态调整大小功能。 This assumes your IMPI is using DAPL (the default). 假设您的IMPI正在使用DAPL（默认值）。 If you are using OF verbs directly, the DAPL variables won't work. 如果直接使用OF动词，则DAPL变量将不起作用。

Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable . 编辑：因此，要使其运行的最终解决方案是设置I_MPI_DAPL_UD=enable 。 I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this. 我可以推测魔法的起源，但我无法访问英特尔的代码来实际证实这一点。

IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). IB可以具有不同的传输模式，其中两个是RC（可靠连接）和UD（不可靠数据报）。 RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. RC需要主机之间的显式连接（如TCP），并且每个连接花费一些内存。 More importantly, each connection has those eager buffers tied to it, and this really adds up. 更重要的是，每个连接都有与之相关的急切缓冲区，这确实加起来了。 This is what you get with Intel's default settings. 这是英特尔默认设置的结果。

There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). 可以进行优化：在连接之间共享急切缓冲区（这称为SRQ - 共享接收队列）。 There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node. 还有一个名为XRC（扩展RC）的Mellanox扩展，它进一步实现了队列共享：在同一节点上的进程之间。 By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. 默认情况下，Intel的MPI通过DAPL访问IB设备，而不是直接通过OF动词访问IB设备。 My guess is this precludes these optimizations (I don't have experience with DAPL). 我的猜测是这排除了这些优化（我没有DAPL的经验）。 It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL). 通过设置I_MPI_FABRICS=shm:ofa I_MPI_OFA_USE_XRC=1和I_MPI_OFA_USE_XRC=1 （使英特尔MPI使用OFA接口而不是DAPL），可以启用XRC支持。

When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. 当您切换到UD传输时，您可以在缓冲区共享的基础上进一步优化：不再需要跟踪连接。 The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. 缓冲区共享在此模型中很自然：由于没有连接，所有内部缓冲区都在共享池中，就像SRQ一样。 So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. 因此可以节省更多内存，但需要付出代价：数据报交付可能会失败，并且由软件而不是IB硬件来处理重新传输。 This is all transparent to the application code using MPI, of course. 当然，这对使用MPI的应用程序代码都是透明的。

MPI_SEND占据了虚拟内存的很大一部分

问题描述

2 个解决方案

解决方案1
10 已采纳 2012-10-26 15:20:17

解决方案2
5 2012-10-26 15:40:25

MPI_SEND占据了虚拟内存的很大一部分

问题描述

2 个解决方案

解决方案1 10 已采纳 2012-10-26 15:20:17

解决方案2 5 2012-10-26 15:40:25

解决方案1
10 已采纳 2012-10-26 15:20:17

解决方案2
5 2012-10-26 15:40:25