将线程从 2 增加到 3 不会增加 mandelbrot 中的加速量

Question

I am using Intel i5 processor with 4 cores and 4 threads.我使用的是 4 核 4 线程的 Intel i5 处理器。 Currently I am working on simulating mandelbrot set using pthreads and ISPC(Intel SPMD Program Compiler).目前我正在使用 pthreads 和 ISPC（英特尔 SPMD 程序编译器）模拟 mandelbrot 集。 When I used two threads for computing mandelbrot set image, based on the task division ie spatial decomposition of image, I see 1.9x speed and when I use 3 threads I see 1.65x speed up and with 4 threads speed up satturates to 2.4x.当我使用两个线程来计算 mandelbrot 集图像时，基于任务划分，即图像的空间分解，我看到 1.9 倍的速度，当我使用 3 个线程时，我看到 1.65 倍的速度，而 4 个线程的速度饱和到 2.4 倍。 Since i5 has 4 threads, it is expected to have 4x speed up with ample parallelism in the program (using pthread).Why there is decline in speed up when using 3 threads?由于 i5 有 4 个线程，因此在程序（使用 pthread）中具有足够的并行性时，预计速度将提高 4 倍。为什么使用 3 个线程时速度会下降？ What are the reasons that I don't see expected speed up?What are the ways to get higher speed up with ample parallelism in case of mandelbrot?我没有看到预期加速的原因是什么？在 mandelbrot 的情况下，有什么方法可以通过足够的并行性获得更高的速度？

Note: I am using gcc to compile with pthreads as API. The division is based on spatial decomposition of image.注：我是用gcc和pthreads编译为API。划分是基于图像的空间分解。 I am not using any locks or semaphores.我没有使用任何锁或信号量。

wiki link for mandelbrot http://en.wikipedia.org/wiki/Mandelbrot_set mandelbrot 的 wiki 链接http://en.wikipedia.org/wiki/Mandelbrot_set

github link for ISPC documents http://ispc.github.io/ github ISPC 文档链接http://ispc.github.io/

In case if you find that the questions are irrelevant please redirect me to appropriate sources.Thank you for your time.如果您发现问题无关紧要，请将我重定向到适当的来源。感谢您抽出时间。

Answer 1

Spatial decomposition does not know how expensive each pixel is to compute.空间分解不知道每个像素的计算成本。 One pixel may require 400 iterations while its neighboring pixel can complete in just 2 iterations.一个像素可能需要 400 次迭代，而其相邻像素只需 2 次迭代即可完成。 Due to this, you should use a load-balancing algorithm and the easiest is to atomically increment a work-item index value from all participating threads.因此，您应该使用负载平衡算法，最简单的方法是从所有参与线程自动增加工作项索引值。

Here is my solution to load-balancing a mandelbrot-like unknown workload:这是我对类似 mandelbrot 的未知工作负载进行负载平衡的解决方案：

atomic_index = 0;
launch 3 threads
   in each thread:
       chunk = 100-150 pixels or maybe bigger
       my_index = atomic_index.fetch_add(chunk);
       for(j = 0 to chunk)
           i = my_index + j
           if i>=total 
               break
           else
               compute_mandelbrot(i%width, i/width)

If you don't write a load-balancer, then you can still fake a load-balancing by simply launching 100 threads and let the CPU/OS scheduler keep the cores busy until mandelbrot is finished.如果您不编写负载均衡器，那么您仍然可以通过简单地启动 100 个线程来伪装负载均衡，并让 CPU/OS 调度程序让核心保持忙碌直到 mandelbrot 完成。 I tried this too, but the atomic load-balancer had ~10-15% higher performance due to not creating/destroying too many threads.我也试过这个，但是由于没有创建/销毁太多线程，原子负载平衡器的性能提高了 ~10-15%。

With dedicated threads that are kept alive, the performance gain would be a bit higher for the load-balancer (perhaps ~100 microseconds per thread).使用保持活动状态的专用线程，负载均衡器的性能增益会更高一些（每个线程大约 100 微秒）。

Atomic load balancing is performance-aware.原子负载平衡是性能感知的。 If you write an iterative load-balancer instead, then you lose performance in first few iterations (rendering first few frames/zooms) until a fair balance is achieved.如果您改为编写迭代负载平衡器，那么您会在前几次迭代（渲染前几帧/缩放）中失去性能，直到达到公平的平衡。 But with atomic load balancing, it is complete within the first frame.但是有了原子负载平衡，它在第一帧内就完成了。 The advantage of iterative load-balancer is that it does not require threads to communicate through an atomic variable and this may become faster for some algorithms but mandelbrot rendering is performing better with atomic load balancing.迭代负载均衡器的优点是它不需要线程通过原子变量进行通信，这对于某些算法来说可能会更快，但 mandelbrot 渲染在原子负载平衡下表现更好。

In your case, it probably worked like this:在你的情况下，它可能是这样工作的：

2 threads: 2个线程：

"15 milliseconds run-time"
- - - - - - - - - \
- - - - - - - - -  =>
- - - - - - - - -  => 
- - - - - - - - -  => thread 1's spatial share (15 milliseconds)
- x x x - - - - -  =>
- - x x x x - - - / 
- - x x x x - - - \  
- x x x - - - - -  =>
- - - - - - - - -  =>
- - - - - - - - -  => thread 2's spatial share (15 milliseconds)
- - - - - - - - -  => 
- - - - - - - - - /

first thread: half empty, half mandelbrot surface第一个线程：一半是空的，一半是 Mandelbrot 表面
second thread: half mandelbrot surface, half empty第二个线程：一半曼德尔布罗表面，一半是空的

Both threads do work!两个线程都有效！

3 threads: 3个线程：

"25 ms run-time"
- - - - - - - - - \
- - - - - - - - -  =>
- - - - - - - - -  => thread 1's spatial share (5 milliseconds)
- - - - - - - - - /
- x x x - - - - - \
- - x x x x - - -  =>
- - x x x x - - -  => thread 2: (25 ms) Bottleneck!
- x x x - - - - - /
- - - - - - - - - \
- - - - - - - - - =>
- - - - - - - - - =>  thread 3's spatial share (5 milliseconds)
- - - - - - - - - /

first thread: empty pixels第一个线程：空像素
second thread: mandelbrot surface with a lot of depth.第二个线程：Mandelbrot 表面有很多深度。 Expensive!昂贵的！
third thread: empty pixels第三个线程：空像素

Only 1 thread does work!只有 1 个线程有效！

so when you added a fourth thread , you doubled the number of threads working for the expensive interior again:所以当你添加第四个线程时，你再次将为昂贵的内部工作的线程数量增加了一倍：

"12 milliseconds run-time"
- - - - - - - - - \ 
- - - - - - - - -  => thread 1: 3 milliseconds
- - - - - - - - - /  
- - - - - - - - - \
- x x x - - - - -  => thread 2: 12 milliseconds
- - x x x x - - - /
- - x x x x - - - \
- x x x - - - - -  => thread 3: 12 milliseconds
- - - - - - - - - /
- - - - - - - - - \
- - - - - - - - -  => thread 4: 3 milliseconds
- - - - - - - - - /

first thread: empty第一个线程：空
second thread: half of mandelbrot surface第二个线程：曼德布罗表面的一半
third thread: half of mandelbrot surface第三条线：曼德尔布罗表面的一半
fourth thread: empty第四个线程：空

with the extra thread working on the other empty area for a minor speedup.额外的线程在另一个空白区域工作以实现较小的加速。

You can also do the spatial distribution on 2D chunks but it is still not free of unknown-work-per-pixel-based slowdowns.您也可以在 2D 块上进行空间分布，但它仍然无法避免基于未知的每像素工作的减速。 You can partially solve this problem by sampling pixels of 4 corner points of each 2D chunk.您可以通过对每个 2D 块的 4 个角点的像素进行采样来部分解决此问题。 This way, you can get an idea about how expensive it is to compute whole chunk.通过这种方式，您可以了解计算整个块的成本。 For example, if 4 corner pixels sampled take different amount of iterations, then quite possibly the interior of chunk will behave similarly so that you can compute a "cost" estimation per 2D chunk and sort all chunks on their cost values then start processing them one by one starting from the highest cost.例如，如果采样的 4 个角像素采用不同的迭代次数，那么块的内部很可能会表现相似，因此您可以计算每个 2D 块的“成本”估计并根据其成本值对所有块进行排序，然后开始处理它们从成本最高的开始。 After you place all the highest-cost chunks on all cores, the cores will be more busy than simple spatial work distribution and without any atomic messaging.在所有核心上放置所有成本最高的块后，核心将比简单的空间工作分配更忙，并且没有任何原子消息传递。 But this has a cost of computation of 4 pixels per chunk multiplied by total number of chunks.但这具有每块 4 个像素乘以块总数的计算成本。

On FX8150 3.6GHz CPU (8 cores), 2000x2000 pixels, 35 max iterations per pixel, rendering yielded these running-times:在 FX8150 3.6GHz CPU（8 核）、2000x2000 像素、每像素 35 次最大迭代上，渲染产生了这些运行时间：

8 threads with equal distribution: 30+ milliseconds 8个线程平均分配：30+毫秒
32 threads with equal distribution: 21 milliseconds 32个线程平均分配：21毫秒
128 threads with equal distribution: ~25 milliseconds 128 个均匀分布的线程：~25 毫秒
8 threads with atomic-load-balancing: 18 milliseconds 8 个具有原子负载平衡的线程：18 毫秒

with SIMD (AVX) usage through auto-vectorization of GCC compiler.通过 GCC 编译器的自动矢量化使用 SIMD (AVX)。 This was partly due to the CPU having only 4 real FPUs shared by 8 cores that made it look like only 4 threads sharing the rendering workload but the load balancing made it much better.这部分是由于 CPU 只有 4 个真正的 FPU，由 8 个内核共享，这使得它看起来像只有 4 个线程共享渲染工作负载，但负载平衡使它变得更好。

There are also more efficient work distribution algorithms but they do not map well to x86 architecture.还有更高效的工作分配算法，但它们不能很好地适应 map 和 x86 架构。 For example, you can:例如，您可以：

do first iterations of all pixels对所有像素进行第一次迭代
enqueue second iterations to a queue but only from pixels that still require a second iteration将第二次迭代排入队列，但仅来自仍需要第二次迭代的像素
do second iterations and enqueue third iterations进行第二次迭代并将第三次迭代入队
repeat until queue gets only zero ierations重复直到队列只有零次迭代

This way, each new queue can work with simple equal distribution on n threads.这样，每个新队列都可以在 n 个线程上以简单的均等分配方式工作。 But x86 does not have hardware acceleration for this workflow and would bottleneck on the enqueue steps.但是 x86 没有针对此工作流程的硬件加速，并且会在入队步骤上出现瓶颈。 Its probably much better on gpus as they have hardware acceleration for all the synchronization primitives that x86 lacks.它在 gpus 上可能更好，因为它们对 x86 缺少的所有同步原语都有硬件加速。

将线程从 2 增加到 3 不会增加 mandelbrot 中的加速量

问题描述

1 个解决方案

解决方案1
1 2022-05-02 16:50:39

将线程从 2 增加到 3 不会增加 mandelbrot 中的加速量

问题描述

1 个解决方案

解决方案1 1 2022-05-02 16:50:39

解决方案1
1 2022-05-02 16:50:39