[英]Why is Numpy with Ryzen Threadripper so much slower than Xeon?
I know that Numpy can use different backends like OpenBLAS or MKL.我知道 Numpy 可以使用不同的后端,如 OpenBLAS 或 MKL。 I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?
我还读到 MKL 针对 Intel 进行了高度优化,所以通常人们建议在 AMD 上使用 OpenBLAS,对吗?
I use the following test code:我使用以下测试代码:
import numpy as np
def testfunc(x):
np.random.seed(x)
X = np.random.randn(2000, 4000)
np.linalg.eigh(X @ X.T)
%timeit testfunc(0)
I have tested this code using different CPUs:我已经使用不同的 CPU 测试了这段代码:
I am using the same Conda environment on all three systems.我在所有三个系统上都使用相同的 Conda 环境。 According to
np.show_config()
, the Intel system uses the MKL backend for Numpy ( libraries = ['mkl_rt', 'pthread']
), whereas the AMD systems use OpenBLAS ( libraries = ['openblas', 'openblas']
).根据
np.show_config()
,英特尔系统使用 Numpy 的 MKL 后端( libraries = ['mkl_rt', 'pthread']
),而 AMD 系统使用 OpenBLAS ( libraries = ['openblas', 'openblas']
) . The CPU core usage was determined by observing top
in a Linux shell: CPU 内核使用率是通过观察 Linux shell 中的
top
来确定的:
The above observations give rise to the following questions:上述观察引出以下问题:
Update 1: The OpenBLAS version is 0.3.6.更新 1: OpenBLAS 版本为 0.3.6。 I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for
testfunc
is still 1.55s on AMD Ryzen Threadripper 3970X.我在某处读到,升级到新版本可能会有所帮助,但是,随着 OpenBLAS 更新到 0.3.10,AMD Ryzen Threadripper 3970X 上的
testfunc
性能仍然是 1.55s。
Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5
(as described here ) reduces the run time for testfunc
on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying.更新 2:将 Numpy 的 MKL 后端与环境变量
MKL_DEBUG_CPU_TYPE=5
结合使用(如此处所述)将 AMD Ryzen Threadripper testfunc
上 testfunc 的运行时间减少到仅 0.52 秒,这实际上或多或少令人满意。 FTR, setting this variable via ~/.profile
did not work for me on Ubuntu 20.04. FTR,在 Ubuntu 20.04 上通过
~/.profile
设置此变量对我不起作用。 Also, setting the variable from within Jupyter did not work.此外,从 Jupyter 中设置变量也不起作用。 So instead I put it into
~/.bashrc
which works now.因此,我将其放入
~/.bashrc
中,现在可以使用了。 Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?无论如何,性能比旧的 Intel Xeon 快 35%,这就是我们得到的全部,还是我们可以从中得到更多?
Update 3: I play around with the number of threads used by MKL/OpenBLAS:更新 3:我玩弄 MKL/OpenBLAS 使用的线程数:
The run times are reported in seconds.运行时间以秒为单位报告。 The best result of each column is underlined.
每列的最佳结果带有下划线。 I used OpenBLAS 0.3.6 for this test.
我在这个测试中使用了 OpenBLAS 0.3.6。 The conclusions from this test:
本次测试的结论:
Update 4: Just for clarification.更新4:只是为了澄清。 No, I do not think that (a) this or (b) that answers this question.
不,我不认为(a) 这个或(b)回答了这个问题。 (a) suggests that "OpenBLAS does nearly as well as MKL" , which is a strong contradiction to the numbers I observed.
(a) 表明“OpenBLAS 的表现几乎与 MKL 一样好” ,这与我观察到的数字强烈矛盾。 According to my numbers, OpenBLAS performs ridiculously worse than MKL.
根据我的数据,OpenBLAS 的表现比 MKL 差得离谱。 The question is why.
问题是为什么。 (a) and (b) both suggest using
MKL_DEBUG_CPU_TYPE=5
in conjunction with MKL to achieve maximum performance. (a) 和 (b) 都建议将
MKL_DEBUG_CPU_TYPE=5
与 MKL 结合使用以实现最大性能。 This might be right, but it does neither explain why OpenBLAS is that dead slow.这可能是对的,但它并不能解释为什么OpenBLAS如此缓慢。 Neither it explains, why even with MKL and
MKL_DEBUG_CPU_TYPE=5
the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon .它也没有解释,为什么即使使用 MKL 和
MKL_DEBUG_CPU_TYPE=5
核 Threadripper 也只比使用了 6 年的 6 核 Xeon 快 36% 。
As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE
to prevent people on AMD use the workaround presented in the accepted answer.截至 2021 年,英特尔不幸删除了
MKL_DEBUG_CPU_TYPE
以防止 AMD 用户使用已接受答案中提供的解决方法。 This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.这意味着该解决方法不再有效,AMD 用户必须切换到 OpenBLAS 或继续使用 MKL。
To use the workaround, follow this method :要使用解决方法,请遵循以下方法:
conda
environment with conda
's and NumPy's MKL=2019.conda
和 NumPy 的 MKL=2019 创建一个conda
环境。MKL_DEBUG_CPU_TYPE
= 5MKL_DEBUG_CPU_TYPE
= 5 The commands for the above steps:上述步骤的命令:
conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
conda activate my_env
conda env config vars set MKL_DEBUG_CPU_TYPE=5
And thats it!就是这样!
I think this should help:我认为这应该有所帮助:
"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." “图表中的最佳结果是 TR 3960x 在环境 var MKL_DEBUG_CPU_TYPE=5 的情况下使用 MKL。它明显优于仅来自 MKL 的低优化代码路径。而且,OpenBLAS 的效果几乎与 MKL_DEBUG_CPU_TYPE=5 的 MKL 一样好放。” https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/
https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications- 1637/
How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables.如何设置:'通过在系统环境变量中输入 MKL_DEBUG_CPU_TYPE=5 使设置永久化。 This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the.bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new
这有几个优点,其中之一是它适用于 Matlab 的所有实例,而不仅仅是使用 .bat 文件打开的那个 https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_amd_to_code/?排序=新
Wouldn't it make sense to try using an optimized BLIS library from AMD ?尝试使用 AMD 的优化 BLIS 库是否有意义?
Maybe I am missing (misunderstanding) something, but I would assume you could use BLIS instead of OpenBLAS.也许我遗漏了(误解)一些东西,但我假设你可以使用 BLIS 而不是 OpenBLAS。 The only potential problem could be that AMD BLIS is optimized for AMD EPYC (but you're using Ryzen).
唯一的潜在问题可能是 AMD BLIS 针对 AMD EPYC 进行了优化(但您使用的是 Ryzen)。 I'm VERY curious about the results, since I'm in the process of buying a server for work, and am considering AMD EPYC and Intel Xeon.
我对结果很好奇,因为我正在为工作购买服务器,并且正在考虑 AMD EPYC 和 Intel Xeon。
Here are the respective AMD BLIS libraries: https://developer.amd.com/amd-aocl/以下是各自的 AMD BLIS 库: https://developer.amd.com/amd-aocl/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.