简体   繁体   English

带有和不带有编译器优化的并行OpenMP代码的性能(Sun CC)

[英]Performance of parallel OpenMP code with and without compiler optimization (Sun CC)

I am working on a project where we were asked to write a simple OpenMP code to parallelize a program that works with differential equations. 我正在做一个项目,要求我们编写一个简单的OpenMP代码来并行化使用微分方程的程序。 We were also asked to test the performance of the code with and without compiler optimizations. 我们还被要求在有和没有编译器优化的情况下测试代码的性能。 I'm working with the Sun CC compiler, so for the optmized version I used the options 我正在使用Sun CC编译器,因此对于优化版本,我使用了选项

-xopenmp -fast

and for the non optimized 对于非优化

-xopenmp=noopt

Not surprisingly the running time with the compiler optimisation on was much lower than in the other case. 毫不奇怪,启用编译器优化的运行时间比其他情况要少得多。 What surprises me is that the scaling performances are much better on the non-optimised version. 令我惊讶的是,在非优化版本上,缩放性能要好得多。 Here, by performance I mean the speed-up coefficient, that is the ratio of the running time of the program ran over M processors and the running time of the program ran on 1 processor. 在这里,性能是指加速系数,即程序的运行时间在M个处理器上运行与程序的运行时间在1个处理器上的比例。

It was hinted that this could depend on the fact that the optimised version is memory-bound, while the non optimised version is CPU-bound. 暗示这可能取决于以下事实:优化版本受内存限制,而非优化版本受CPU限制。 I am not sure of how the "boundness" could influence the scaling capability of my code. 我不确定“边界”如何影响代码的扩展能力。 Do you have any suggestion? 你有什么建议吗?

On most multi-processor systems, multiple CPU cores share a single path to memory. 在大多数多处理器系统上,多个CPU内核共享一条内存路径。 A given output binary will have a certain inherent computational intensity (calculations per byte accessed) per thread. 给定的输出二进制文件每个线程将具有一定的固有计算强度(每个访问的字节计算)。 When the number of cores you're running the code on lets it exceed an operation rate greater than the necessary memory bandwidth to support it, it will stop scaling with additional cores. 当您正在运行代码的内核数超过了支持它所需的内存带宽的运行速率时,它将停止扩展其他内核。 To get a good view on how to reason about this kind of issue, look up the 'roofline model'. 要更好地了解如何推理此类问题,请查找“屋顶线模型”。

There are two changes I'd expect to see from enabling optimization. 我期望启用优化会带来两个变化。 One of them is that the computational intensity should increase somewhat, if the optimization provide any sort of loop blocking to reduce memory access. 其中之一是,如果优化提供任何形式的循环阻塞以减少内存访问,则计算强度应有所增加。 The other is that the raw operation rate should increase with better identification of vectorization opportunities and subsequent instruction selection and scheduling. 另一个是原始操作率应随着更好地识别矢量化机会以及随后的指令选择和调度而增加。 These two things should have opposite effects on scaling efficiency, but the latter one clearly dominates in your case. 这两件事应该对缩放效率产生相反的影响,但是在您的情况下,后者显然占主导地位。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM