[英]Why settting CPU affinity make threads run slower?
all. 所有。
I wrote a small case to test the multi-threaded producer/consumer model. 我写了一个小案例来测试多线程生产者/消费者模型。 My testbed is a low performance PC(8G RAM, J1900 CPU with 4 cores).
我的测试平台是一台性能低下的PC(8G RAM,4核J1900 CPU)。 I isolated the core 0 for Linux kernel, core 1-3 are unused.
我为Linux内核隔离了核心0,未使用核心1-3。 The producer thread runs on core 1, allocates 5000000 small objects, put them to a global queue.
生产者线程在核心1上运行,分配5000000个小对象,并将它们放入全局队列。 The consumer thread runs on core 2 and deallocate the objects from the queue.
使用者线程在核心2上运行,并从队列中取消分配对象。 But I found if I do not set their CPU affinity(that is, they run on same core 0), the time performance get better than setting CPU affinity(8.76s VS 14.66s).
但是我发现如果不设置它们的CPU亲和力(即它们在相同的内核0上运行),则时间性能会比设置CPU亲和力(8.76s VS 14.66s)更好。 The test results keep similar.
测试结果保持相似。 Could someone explain the reason for me?
有人可以为我解释原因吗? IF my premise is incorrect("Setting CPU affinity can improve multi-threaded process' performance"), it should not get worse at least.
如果我的前提不正确(“设置CPU亲和力可以提高多线程进程的性能”),那么至少它不会恶化。 My code slice listed below:
我的代码片段如下所示:
void producer() {
Timestamp begin;
for ( int i = 0; i<data_nb; ++i ) {
Test* test = new Test(i, i+1);
queue.enqueue(test);
}
Timestamp end;
TimeDuration td = end-begin;
printf("producer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
void consumer() {
Timestamp begin;
do {
Test* test = queue.dequeue();
if ( test ) {
nb.add(1); // nb is an atomic counter
delete test;
test = nullptr;
}
} while ( nb.get() < data_nb );
Timestamp end;
TimeDuration td = end-begin;
//printf("%d data consumed\n", nb.get());
printf("consumer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
Gaining performance from CPU affinity is not so simple as pushing thread 1 to core 1 and thread 2 to core 2. This is a complex and heavily researched topic, I'll touch on the highlights. 从CPU亲和力获得性能并不像将线程1推送到核心1以及将线程2推送到核心2那样简单。这是一个复杂且经过大量研究的主题,我将重点介绍。
First off we need to define 'performance'. 首先,我们需要定义“性能”。 Typically, we are interested in throughput , latency and/or scalability .
通常,我们对吞吐量 , 延迟和/或可伸缩性感兴趣。 Combining all three is a tricky architecture question that has received intense scrutiny in the Telecom, Finance and other industries.
将这三个因素结合在一起是一个棘手的体系结构问题,在电信,金融和其他行业中受到了严格的审查。
Your case appears to be driven by the throughput metric. 您的案例似乎受吞吐量指标驱动。 We want the sum of wall clock time across threads to be minimal.
我们希望跨线程的挂钟时间总和最小。 Another way to state your question might be, "What are the factors that affect throughput in a multithreaded process?"
提出您问题的另一种方式可能是,“影响多线程进程中吞吐量的因素是什么?”
Here are some of the many factors: 以下是许多因素中的一些因素:
Applying these ideas to your example, I see some issues. 将这些想法应用到您的示例中,我看到了一些问题。 I assume you have two separate threads running hot and in parallel.
我假设您有两个单独的线程同时并行运行。 One thread producing (new) and the other consuming (delete).
一个线程产生(新),另一个消耗(删除)。
The heap is serial. 堆是串行的。 Calling new and delete in different threads is a known performance bottleneck.
在不同的线程中调用new和delete是一个已知的性能瓶颈。 There are several small block parallel allocators available including Hoard .
有几种小块并行分配器可用,包括Hoard 。
The queue is likely serial. 该队列可能是串行的。 The implementation is not shown but the enqueue/dequeue is likely a point of serialization between the two threads.
没有显示实现,但是入队/出队可能是两个线程之间的序列化点。 There are many examples of lock free ring buffers that can be used between multiple threads.
有许多可以在多个线程之间使用的无锁环形缓冲区的示例。
Thread starvation. 线程饥饿。 In the example, if the producer is slower than the consumer then consumer will just be idling for much of the time.
在示例中,如果生产者比消费者慢,那么消费者将在很多时间里处于闲置状态。 This is part of the queue theory that must be considered when crafting a high performance algorithm.
这是制定高性能算法时必须考虑的队列理论的一部分。
With all that background, we can now conclude that thread affinity isn't likely to matter until the serialization and starvation problems are solved. 在所有这些背景下,我们现在可以得出结论,在解决序列化和饥饿问题之前,线程亲和力不太重要。 The two threads may in fact run slower as they contend against one another for the shared resource or just waste cpu time idling.
实际上,这两个线程的运行速度可能会变慢,因为它们争用共享资源,或者只是浪费CPU时间空闲。 As a result, the overall throughput goes down thus the wall clock time goes up.
结果,总体吞吐量下降,因此挂钟时间增加。
There is a huge demand in industry for engineers who understand these kinds of algorithms. 对于了解这类算法的工程师来说,行业需求巨大。 Educating yourself is likely to be a profitable venture.
教育自己很可能是一项有利可图的冒险。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.