多线程降低了NUMA上的套接字吞吐量

Question

I benchmarked a Java program on a 16 core NUMA machine with Red Had Linux. 我使用Red Had Linux在16核NUMA机器上对Java程序进行了基准测试。 I measured the throughput of a Java DatagramSocket (for UDP) in terms of how many packets (of 64 Bytes size) it was able to receive and send per second. 我根据每秒能够接收和发送的数据包（64字节大小）来测量Java DatagramSocket（对于UDP）的吞吐量。 The program consisted of a single socket and n threads that were listening on the socket. 该程序由一个套接字和n个正在侦听套接字的线程组成。 When a packet arrived, they would copy the payload into a byte[] array, create a new DatagramPacket with that array and send it straight-away back to where it came from. 当数据包到达时，他们会将有效负载复制到byte []数组中，使用该数组创建一个新的DatagramPacket并将其直接发送回它来自的位置。 Think of it as a ping on the UDP layer. 可以把它想象成UDP层上的ping。

I found that the Java DatagramSocket socket achieves a significantly smaller throughput when using more than one thread, ie two or four. 我发现Java DatagramSocket套接字在使用多个线程（即两个或四个）时实现了明显更小的吞吐量。 If I use only one thread to listen on the socket, I achieve a throughput of 122,000 packets per second, while more than one threads achieve only 65,000 packets per second. 如果我只使用一个线程来监听套接字，我实现了每秒122,000个数据包的吞吐量，而多个线程每秒只能实现65,000个数据包。 Now, I understand that a thread might be executed on any core of the NUMA machine and that memory accesses become expensive if the memory has to travel from one node to another. 现在，我知道一个线程可能在NUMA机器的任何核心上执行，并且如果内存必须从一个节点移动到另一个节点，则内存访问会变得昂贵。 However, if I have two threads, only one should be executed on the “wrong” core, while the other should still achieve a very high throughput. 但是，如果我有两个线程，则只应在“错误”核心上执行一个线程，而另一个线程仍然应该实现非常高的吞吐量。 Another possible explanation is a synchronization problem in the Datagramsocket but these are only guesses. 另一个可能的解释是Datagramsocket中的同步问题，但这些只是猜测。 Does anybody have a good insight in what the real explanation is? 有没有人对真正的解释有什么了解？
I found that executing this program multiple times (in parallel) on multiple ports achieves a higher overall throughput. 我发现在多个端口上多次（并行）执行此程序可实现更高的总吞吐量。 I started the program with one thread four times and each program used a socket on a separate port (5683, 5684, 5685 and 5686). 我用一个线程启动程序四次，每个程序在一个单独的端口（5683,5684,5685和5686）上使用套接字。 The combined throughput of the four programs was 370,000 packets per second. 四个程序的总吞吐量为每秒370,000个数据包。 In summary, using more than one thread on the same port decreases the throughput, while using more than one port with one thread each increases it. 总之，在同一端口上使用多个线程会降低吞吐量，而使用多个端口和一个线程会增加吞吐量。 How is this explainable? 这怎么解释？

System specifications: 系统规格：

Hardware: 16 cores on 2 AMD Opteron(TM) Processor 6212 processors organized in 4 nodes with 32 GB RAM each. 硬件：2个AMD Opteron（TM）处理器6212处理器上的16个内核，分为4个节点，每个节点有32 GB RAM。 Frequency: 1.4 Ghz, 2048 KB cache. 频率：1.4 Ghz，2048 KB缓存。

node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

The OS is a Red Hat Enterprise Linux Workstation release 6.4 (Santiago) with kernel version 2.6.32-358.14.1.el6.x86_64 . 该操作系统是红帽企业Linux工作站版本6.4（圣地亚哥），内核版本为2.6.32-358.14.1.el6.x86_64 。 Java version "1.7.0_09" , Java(TM) SE Runtime Environment ( build 1.7.0_09-b05 ), Java HotSpot(TM) 64-Bit Server VM ( build 23.5-b02, mixed mode ) and I used the -XX:+UseNUMA flag. Java版"1.7.0_09" ，Java（TM）SE运行时环境（ build 1.7.0_09-b05 ），Java HotSpot（TM）64位服务器VM（ build 23.5-b02, mixed mode ），我使用-XX:+UseNUMA标志。 Server and client are connected over 10GB Ethernet. 服务器和客户端通过10GB以太网连接。

Answer 1

In general, you are most efficient when using only one thread. 通常，只使用一个线程时效率最高。 Making stuff parallel will inevidently introduce cost. 使东西并行将无形中引入成本。 The gain in throughput will only come when the additional amount of work you can do in parallel overweights this cost. 只有当您可以并行执行的额外工作量超过此成本时，才会获得吞吐量的增加。

Now, Amdahl's law illustrates the theoretical gain in throughput in relation to how much of your work consists of stuff that can be parallelized / cannot be parallelized. 现在， Amdahl定律说明了吞吐量的理论增益，与您的工作量有多少可以并行/无法并行化有关。 For example, if only 50% of your task is parallelizable, you can only get x2 increase in throughput regardless of how many threads you throw at the problem. 例如，如果只有50％的任务是可并行化的，那么无论您在问题中引入多少线程，都只能使吞吐量增加x2。 Note that the chart you see inside the link ignores the cost of adding threads. 请注意，您在链接中看到的图表忽略了添加线程的成本。 In reality, native OS threads do add quite a bit of cost and esp. 实际上，本机操作系统线程确实增加了相当多的成本，尤其是。 when a lot of them are trying to access a shared resource. 当他们中的很多人试图访问共享资源时。

In your case, when you used only one socket, most of your work was not parallelizable. 在您的情况下，当您只使用一个套接字时，您的大部分工作都无法并行化。 Hence using a single thread gave superior performance and adding threads made it worse because of the costs they added. 因此，使用单个线程可以提供卓越的性能，并且由于增加了成本，添加线程会使其变得更糟。 In your second experiment, you increased the work that can be parallelized by using more than one socket. 在第二个实验中，您通过使用多个套接字增加了可以并行化的工作。 Hence you gained in throughput despite adding some cost by using threads. 因此，尽管使用线程增加了一些成本，但您获得了吞吐量。

多线程降低了NUMA上的套接字吞吐量

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-09-24 08:34:44

多线程降低了NUMA上的套接字吞吐量

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-09-24 08:34:44

解决方案1
1 已采纳 2013-09-24 08:34:44