为什么设置CPU亲和力会使线程运行速度变慢？

Question

all. 所有。

I wrote a small case to test the multi-threaded producer/consumer model. 我写了一个小案例来测试多线程生产者/消费者模型。 My testbed is a low performance PC(8G RAM, J1900 CPU with 4 cores). 我的测试平台是一台性能低下的PC（8G RAM，4核J1900 CPU）。 I isolated the core 0 for Linux kernel, core 1-3 are unused. 我为Linux内核隔离了核心0，未使用核心1-3。 The producer thread runs on core 1, allocates 5000000 small objects, put them to a global queue. 生产者线程在核心1上运行，分配5000000个小对象，并将它们放入全局队列。 The consumer thread runs on core 2 and deallocate the objects from the queue. 使用者线程在核心2上运行，并从队列中取消分配对象。 But I found if I do not set their CPU affinity(that is, they run on same core 0), the time performance get better than setting CPU affinity(8.76s VS 14.66s). 但是我发现如果不设置它们的CPU亲和力（即它们在相同的内核0上运行），则时间性能会比设置CPU亲和力（8.76s VS 14.66s）更好。 The test results keep similar. 测试结果保持相似。 Could someone explain the reason for me? 有人可以为我解释原因吗？ IF my premise is incorrect("Setting CPU affinity can improve multi-threaded process' performance"), it should not get worse at least. 如果我的前提不正确（“设置CPU亲和力可以提高多线程进程的性能”），那么至少它不会恶化。 My code slice listed below: 我的代码片段如下所示：

void producer() {
  Timestamp begin;

  for ( int i = 0; i<data_nb; ++i ) {
    Test* test = new Test(i, i+1);
    queue.enqueue(test);
  }

  Timestamp end;
  TimeDuration td = end-begin;
  printf("producer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}

void consumer() {
  Timestamp begin;

  do {
    Test* test = queue.dequeue();
    if ( test ) {
      nb.add(1); // nb is an atomic counter
      delete test;
      test = nullptr;
    }
  } while ( nb.get() < data_nb );

  Timestamp end;
  TimeDuration td = end-begin;
  //printf("%d data consumed\n", nb.get());
  printf("consumer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}

Answer 1

Gaining performance from CPU affinity is not so simple as pushing thread 1 to core 1 and thread 2 to core 2. This is a complex and heavily researched topic, I'll touch on the highlights. 从CPU亲和力获得性能并不像将线程1推送到核心1以及将线程2推送到核心2那样简单。这是一个复杂且经过大量研究的主题，我将重点介绍。

First off we need to define 'performance'. 首先，我们需要定义“性能”。 Typically, we are interested in throughput , latency and/or scalability . 通常，我们对吞吐量，延迟和/或可伸缩性感兴趣。 Combining all three is a tricky architecture question that has received intense scrutiny in the Telecom, Finance and other industries. 将这三个因素结合在一起是一个棘手的体系结构问题，在电信，金融和其他行业中受到了严格的审查。

Your case appears to be driven by the throughput metric. 您的案例似乎受吞吐量指标驱动。 We want the sum of wall clock time across threads to be minimal. 我们希望跨线程的挂钟时间总和最小。 Another way to state your question might be, "What are the factors that affect throughput in a multithreaded process?" 提出您问题的另一种方式可能是，“影响多线程进程中吞吐量的因素是什么？”

Here are some of the many factors: 以下是许多因素中的一些因素：

Algorithm complexity has the greatest influence. 算法复杂度影响最大。 The Big O , theta, little O are all extremely useful in complex cases. 大O ，θ，小O在复杂情况下都非常有用。 The example is trivial but this still matters. 这个例子很简单，但这仍然很重要。 On the surface, the problem is O(n). 从表面上看，问题是O（n）。 The time will be linear based on the number of elements to allocate/deallocate. 时间将根据分配/取消分配的元素数量成线性关系。 Your question strikes to the heart of this matter because it shows in that physical computers don't perfectly model ideal computers. 您的问题触动了这个问题的中心，因为它表明物理计算机不能完美地建模理想的计算机。
CPU resource. CPU资源。 Having CPU to throw at the problem can help if the problem can be parallelized. 如果可以并行化问题，则让CPU抛出该问题可能会有所帮助。 Your problem has the underlying assumption that two threads will be better than one. 您的问题有一个基本的假设，即两个线程会比一个线程好。 If so, perhaps four will be between than two. 如果是这样，也许四个将在两个之间。 Again, your actual results contradict the theoretical model. 同样，您的实际结果与理论模型相矛盾。
Queuing model. 排队模型。 Understanding the Queuing model is vitally important if performance gains are to be achieved. 如果要获得性能提升，了解排队模型至关重要。 The example problem appears to be the classic single producer/single consumer model. 示例问题似乎是经典的单一生产者/单一消费者模型。
Other Resources. 其他资源。 Depending on the problem a variety of other resources can constrain performance. 根据问题，各种其他资源可能会限制性能。 Some factors include disk space, disk throughput, disk latency, network capacity, socket availability. 一些因素包括磁盘空间，磁盘吞吐量，磁盘延迟，网络容量，套接字可用性。 The example doesn't appear to suffer here. 该示例似乎在这里不受影响。
Kernel dependencies. 内核依赖性。 Moving to a lower level, performance can be dramatically impacted by the amount of kernel interaction. 移至较低级别，内核交互量会极大地影响性能。 Generally, kernel calls require a context switch which can be expensive if done constantly. 通常，内核调用需要上下文切换，如果不断执行，则切换开销很大。 Your example likely suffers this issue via the calls to new/delete. 您的示例可能会通过调用new / delete遇到此问题。
Serial access. 串行访问。 If a resource requires serial access then it will bottleneck in parallel algorithm. 如果资源需要串行访问，那么它将成为并行算法的瓶颈。 Your example appears to have two such problems, the new/delete and the enqueue/dequeue. 您的示例似乎有两个此类问题，即新建/删除和入队/出队。
CPU cache. CPU缓存。 The comments mentioned CPU caching as a possibility. 评论提到CPU 缓存是一种可能。 The L2/L3 caches can be a source of cache misses as well as false sharing . L2 / L3高速缓存可能是高速缓存未命中以及错误共享的来源。 I doubt that this is the main problem in your example but it could be a factor. 我怀疑这是您的示例中的主要问题，但这可能是一个因素。

Applying these ideas to your example, I see some issues. 将这些想法应用到您的示例中，我看到了一些问题。 I assume you have two separate threads running hot and in parallel. 我假设您有两个单独的线程同时并行运行。 One thread producing (new) and the other consuming (delete). 一个线程产生（新），另一个消耗（删除）。

The heap is serial. 堆是串行的。 Calling new and delete in different threads is a known performance bottleneck. 在不同的线程中调用new和delete是一个已知的性能瓶颈。 There are several small block parallel allocators available including Hoard . 有几种小块并行分配器可用，包括Hoard 。

The queue is likely serial. 该队列可能是串行的。 The implementation is not shown but the enqueue/dequeue is likely a point of serialization between the two threads. 没有显示实现，但是入队/出队可能是两个线程之间的序列化点。 There are many examples of lock free ring buffers that can be used between multiple threads. 有许多可以在多个线程之间使用的无锁环形缓冲区的示例。

Thread starvation. 线程饥饿。 In the example, if the producer is slower than the consumer then consumer will just be idling for much of the time. 在示例中，如果生产者比消费者慢，那么消费者将在很多时间里处于闲置状态。 This is part of the queue theory that must be considered when crafting a high performance algorithm. 这是制定高性能算法时必须考虑的队列理论的一部分。

With all that background, we can now conclude that thread affinity isn't likely to matter until the serialization and starvation problems are solved. 在所有这些背景下，我们现在可以得出结论，在解决序列化和饥饿问题之前，线程亲和力不太重要。 The two threads may in fact run slower as they contend against one another for the shared resource or just waste cpu time idling. 实际上，这两个线程的运行速度可能会变慢，因为它们争用共享资源，或者只是浪费CPU时间空闲。 As a result, the overall throughput goes down thus the wall clock time goes up. 结果，总体吞吐量下降，因此挂钟时间增加。

There is a huge demand in industry for engineers who understand these kinds of algorithms. 对于了解这类算法的工程师来说，行业需求巨大。 Educating yourself is likely to be a profitable venture. 教育自己很可能是一项有利可图的冒险。

为什么设置CPU亲和力会使线程运行速度变慢？

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-09-14 21:19:12

为什么设置CPU亲和力会使线程运行速度变慢？

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-09-14 21:19:12

解决方案1
5 已采纳 2016-09-14 21:19:12