为什么带有 Ryzen Threadripper 的 Numpy 比 Xeon 慢这么多？

Question

I know that Numpy can use different backends like OpenBLAS or MKL.我知道 Numpy 可以使用不同的后端，如 OpenBLAS 或 MKL。 I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?我还读到 MKL 针对 Intel 进行了高度优化，所以通常人们建议在 AMD 上使用 OpenBLAS，对吗？

I use the following test code:我使用以下测试代码：

import numpy as np

def testfunc(x):
    np.random.seed(x)
    X = np.random.randn(2000, 4000)
    np.linalg.eigh(X @ X.T)

%timeit testfunc(0)

I have tested this code using different CPUs:我已经使用不同的 CPU 测试了这段代码：

On Intel Xeon E5-1650 v3 , this code performs in 0.7s using 6 out of 12 cores .在Intel Xeon E5-1650 v3上，此代码使用12 个内核中的 6 个在0.7 秒内执行。
On AMD Ryzen 5 2600 , this code performs in 1.45s using all 12 cores .在AMD Ryzen 5 2600上，此代码使用所有 12 个内核在1.45 秒内执行。
On AMD Ryzen Threadripper 3970X , this code performs in 1.55s using all 64 cores .在AMD Ryzen Threadripper 3970X 上，此代码使用所有 64 个内核在 1.55 秒内执行。

I am using the same Conda environment on all three systems.我在所有三个系统上都使用相同的 Conda 环境。 According to np.show_config() , the Intel system uses the MKL backend for Numpy ( libraries = ['mkl_rt', 'pthread'] ), whereas the AMD systems use OpenBLAS ( libraries = ['openblas', 'openblas'] ).根据np.show_config() ，英特尔系统使用 Numpy 的 MKL 后端（ libraries = ['mkl_rt', 'pthread'] ），而 AMD 系统使用 OpenBLAS （ libraries = ['openblas', 'openblas'] ） . The CPU core usage was determined by observing top in a Linux shell: CPU 内核使用率是通过观察 Linux shell 中的top来确定的：

For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).对于Intel Xeon E5-1650 v3 CPU（6 个物理内核），它显示 12 个内核（6 个空闲）。
For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).对于AMD Ryzen 5 2600 CPU（6 个物理内核），它显示 12 个内核（无空闲）。
For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).对于AMD Ryzen Threadripper 3970X CPU（32 个物理内核），它显示 64 个内核（无空闲）。

The above observations give rise to the following questions:上述观察引出以下问题：

Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon?这正常吗，使用 OpenBLAS 的最新 AMD CPU 上的线性代数比使用六年的 Intel Xeon慢得多？ (also addressed in Update 3) （也在更新 3 中解决）
Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases.从 CPU 负载的观察来看，Numpy 在所有三种情况下都使用了多核环境。 How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? Threadripper 的物理核心数几乎是锐龙 5 的六倍，它怎么会比锐龙 5 还要慢呢？ (also see Update 3) （另见更新 3）
Is there anything that can be done to speed up the computations on the Threadripper?有什么办法可以加快 Threadripper 的计算速度？ (partially answered in Update 2) （在更新 2 中部分回答）

Update 1: The OpenBLAS version is 0.3.6.更新 1： OpenBLAS 版本为 0.3.6。 I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc is still 1.55s on AMD Ryzen Threadripper 3970X.我在某处读到，升级到新版本可能会有所帮助，但是，随着 OpenBLAS 更新到 0.3.10，AMD Ryzen Threadripper 3970X 上的testfunc性能仍然是 1.55s。

Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5 (as described here ) reduces the run time for testfunc on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying.更新 2：将 Numpy 的 MKL 后端与环境变量MKL_DEBUG_CPU_TYPE=5结合使用（如此处所述）将 AMD Ryzen Threadripper testfunc上 testfunc 的运行时间减少到仅 0.52 秒，这实际上或多或少令人满意。 FTR, setting this variable via ~/.profile did not work for me on Ubuntu 20.04. FTR，在 Ubuntu 20.04 上通过~/.profile设置此变量对我不起作用。 Also, setting the variable from within Jupyter did not work.此外，从 Jupyter 中设置变量也不起作用。 So instead I put it into ~/.bashrc which works now.因此，我将其放入~/.bashrc中，现在可以使用了。 Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?无论如何，性能比旧的 Intel Xeon 快 35%，这就是我们得到的全部，还是我们可以从中得到更多？

Update 3: I play around with the number of threads used by MKL/OpenBLAS:更新 3：我玩弄 MKL/OpenBLAS 使用的线程数：

The run times are reported in seconds.运行时间以秒为单位报告。 The best result of each column is underlined.每列的最佳结果带有下划线。 I used OpenBLAS 0.3.6 for this test.我在这个测试中使用了 OpenBLAS 0.3.6。 The conclusions from this test:本次测试的结论：

The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).使用 OpenBLAS 的 Threadripper 的单核性能比 Xeon 的单核性能要好一些（快 11%），但是使用 MKL 时，它的单核性能甚至更好（快 34%）。
The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon.使用 OpenBLAS 的 Threadripper 的多核性能比 Xeon 的多核性能差得离谱。 What is going on here?这里发生了什么？
The Threadripper performs overall better than the Xeon , when MKL is used (26% to 38% faster than Xeon).使用 MKL 时，Threadripper 的整体性能优于 Xeon （比 Xeon 快 26% 到 38%）。 The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon). Threadripper 使用 16 个线程和 MKL（比 Xeon 快 36%）实现了整体最佳性能。

Update 4: Just for clarification.更新4：只是为了澄清。 No, I do not think that (a) this or (b) that answers this question.不，我不认为（a）这个或（b）回答了这个问题。 (a) suggests that "OpenBLAS does nearly as well as MKL" , which is a strong contradiction to the numbers I observed. (a) 表明“OpenBLAS 的表现几乎与 MKL 一样好” ，这与我观察到的数字强烈矛盾。 According to my numbers, OpenBLAS performs ridiculously worse than MKL.根据我的数据，OpenBLAS 的表现比 MKL 差得离谱。 The question is why.问题是为什么。 (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5 in conjunction with MKL to achieve maximum performance. (a) 和 (b) 都建议将MKL_DEBUG_CPU_TYPE=5与 MKL 结合使用以实现最大性能。 This might be right, but it does neither explain why OpenBLAS is that dead slow.这可能是对的，但它并不能解释为什么OpenBLAS如此缓慢。 Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5 the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon .它也没有解释，为什么即使使用 MKL 和MKL_DEBUG_CPU_TYPE=5核 Threadripper 也只比使用了 6 年的 6 核 Xeon 快 36% 。

Answer 1

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer.截至 2021 年，英特尔不幸删除了MKL_DEBUG_CPU_TYPE以防止 AMD 用户使用已接受答案中提供的解决方法。 This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.这意味着该解决方法不再有效，AMD 用户必须切换到 OpenBLAS 或继续使用 MKL。

To use the workaround, follow this method :要使用解决方法，请遵循以下方法：

Create a conda environment with conda 's and NumPy's MKL=2019.使用conda和 NumPy 的 MKL=2019 创建一个conda环境。
Activate the environment激活环境
Set MKL_DEBUG_CPU_TYPE = 5设置MKL_DEBUG_CPU_TYPE = 5

The commands for the above steps:上述步骤的命令：

conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
conda activate my_env
conda env config vars set MKL_DEBUG_CPU_TYPE=5

And thats it!就是这样！

Answer 2

I think this should help:我认为这应该有所帮助：

"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." “图表中的最佳结果是 TR 3960x 在环境 var MKL_DEBUG_CPU_TYPE=5 的情况下使用 MKL。它明显优于仅来自 MKL 的低优化代码路径。而且，OpenBLAS 的效果几乎与 MKL_DEBUG_CPU_TYPE=5 的 MKL 一样好放。” https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/ https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications- 1637/

How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables.如何设置：'通过在系统环境变量中输入 MKL_DEBUG_CPU_TYPE=5 使设置永久化。 This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the.bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new这有几个优点，其中之一是它适用于 Matlab 的所有实例，而不仅仅是使用 .bat 文件打开的那个 https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_amd_to_code/?排序=新

Answer 3

Wouldn't it make sense to try using an optimized BLIS library from AMD ?尝试使用 AMD 的优化 BLIS 库是否有意义？

Maybe I am missing (misunderstanding) something, but I would assume you could use BLIS instead of OpenBLAS.也许我遗漏了（误解）一些东西，但我假设你可以使用 BLIS 而不是 OpenBLAS。 The only potential problem could be that AMD BLIS is optimized for AMD EPYC (but you're using Ryzen).唯一的潜在问题可能是 AMD BLIS 针对 AMD EPYC 进行了优化（但您使用的是 Ryzen）。 I'm VERY curious about the results, since I'm in the process of buying a server for work, and am considering AMD EPYC and Intel Xeon.我对结果很好奇，因为我正在为工作购买服务器，并且正在考虑 AMD EPYC 和 Intel Xeon。

Here are the respective AMD BLIS libraries: https://developer.amd.com/amd-aocl/以下是各自的 AMD BLIS 库： https://developer.amd.com/amd-aocl/

为什么带有 Ryzen Threadripper 的 Numpy 比 Xeon 慢这么多？

问题描述

3 个解决方案

解决方案1
2 2021-08-26 17:08:53

解决方案2
1 2020-07-31 14:11:05

解决方案3
1 2020-08-13 14:15:18

为什么带有 Ryzen Threadripper 的 Numpy 比 Xeon 慢这么多？

问题描述

3 个解决方案

解决方案1 2 2021-08-26 17:08:53

解决方案2 1 2020-07-31 14:11:05

解决方案3 1 2020-08-13 14:15:18

解决方案1
2 2021-08-26 17:08:53

解决方案2
1 2020-07-31 14:11:05

解决方案3
1 2020-08-13 14:15:18