简体繁体 English

"<i>Numba bytecode generation for generic x64 processors?<\/i>通用 x64 处理器的 Numba 字节码生成？<\/b> <i>Rather than 1st run compiling a SLOW @njit(cache=True) argument<\/i>而不是第一次运行编译 SLOW @njit(cache=True) 参数<\/b>"

[英]Numba bytecode generation for generic x64 processors? Rather than 1st run compiling a SLOW @njit(cache=True) argument

原文 2022-02-05 21:40:44 7 1 python/ caching/ deployment/ runtime/ numba

I have a pretty large project converted to Numba, and after run #1 with @nb.njit(cache=True, parallel=True, nogil=True), well, it's slow on run #1 (like 15 seconds vs. 0.2-1 seconds after compiling).我有一个相当大的项目转换为 Numba，在使用 @nb.njit(cache=True, parallel=True, nogil=True) 运行 #1 之后，运行 #1 很慢（比如 15 秒 vs. 0.2-编译后 1 秒）。 I realize it's compiling byte code optimized for the specific PC I'm running it on, but since the code is distributed to a large audience, I don't want it to take forever compiling the first run after we deploy our model.我意识到它正在编译针对我运行它的特定 PC 优化的字节码，但由于代码分发给大量受众，我不希望它在我们部署模型后永远编译第一次运行。 What is not covered in the documentation is a "generic x64" cache=True method.文档中未涵盖的是“通用 x64”cache=True 方法。 I don't care if the code is a little slower on a PC that doesn't have my specific processor, I only care that the initial and subsequent runtimes are quick, and prefer that they don't differ by a huge margin if I distribute a cache file for the @njit functions at deployment.我不在乎代码在没有我的特定处理器的 PC 上是否会慢一点，我只关心初始和后续运行时是否很快，并且如果我希望它们不会有很大差异在部署时为 @njit 函数分发缓存文件。

Does anyone know if such a "generic" x64 implementation is possible using Numba?有谁知道使用 Numba 是否可以实现这样的“通用”x64 实现？ Or are we stuck with a slow run #1 and fast ones thereafter?或者我们是否陷入了慢跑#1和之后的快跑？

Please comment if you want more details;如果您需要更多详细信息，请发表评论； basically it's around a 50 lines of code function that gets JIT compiled via Numba and afterwards runs quite fast in parallel with no GIL.基本上它是大约 50 行代码函数，通过 Numba 编译 JIT，然后在没有 GIL 的情况下并行运行得非常快。 But I'm willing to give up some extreme optimization if the code can work in a generic form across multiple processors.但如果代码可以跨多个处理器以通用形式工作，我愿意放弃一些极端优化。 As where I work, the PCs can vary quite a bit in terms of how advanced they are.在我工作的地方，PC 的先进程度可能会有很大差异。

I looked briefly at the AOT (ahead of time) compilation of Numba functions, but these functions, in this case, have so many variables being altered I think it would take me weeks to decorate properly the functions to be compiled without a Numba dependency.我简要查看了 Numba 函数的 AOT（提前）编译，但在这种情况下，这些函数有很多变量被更改，我认为我需要数周时间才能正确装饰要编译的函数而不依赖 Numba。 I really don't have the time to do AOT, it would make more sense to just rewrite in Cython the whole algorithm, and that's more like C\/C++ and more time consuming that I want to devote to this project.我真的没有时间做 AOT，在 Cython 中重写整个算法会更有意义，这更像是 C\/C++，而且我想花更多时间在这个项目上。 Unfortunately there is not (to my knowledge) a Numba -> Cython compiler project out there already.不幸的是（据我所知）还没有 Numba -> Cython 编译器项目。 Maybe there will be in the future (which would be outstanding), but I don't know of such a project out there currently.也许将来会有（这将是杰出的），但我目前不知道有这样的项目。

1 个解决方案

Unfortunately, you mainly listed all the current available options.不幸的是，您主要列出了所有当前可用的选项。 Numba functions can be cached and the signature can be specified so to perform an eager compilation (compilation at the time of the function definition) instead of a lazy one (compilation during the first execution). Numba 函数可以被缓存并且可以指定签名以便执行急切编译（在函数定义时进行编译）而不是惰性编译（在第一次执行期间进行编译）。 Note that the cache=True flag is only meant to skip the compilation when it as already been done on the same platform before and not to share the code between multiple machine.请注意， cache=True标志仅意味着在之前已经在同一平台上完成编译时跳过编译，而不是在多台机器之间共享代码。 AFAIK, the internal JIT used by Numba (LLVM-Lite) does not support that. AFAIK，Numba（LLVM-Lite）使用的内部 JIT 不支持。 In fact, it is exactly the purpose of the AOT compilation to do that.事实上，这正是AOT 编译的目的。 That being said, the AOT compilation requires the signatures to be provided (this is mandatory whatever the approach/tool used as long as the function is compiled) and it has quite strong limitations (eg. currently there is no support for parallel codes and fastmath).话虽如此，AOT 编译需要提供签名（只要编译函数，无论使用何种方法/工具，这都是强制性的）并且它有很大的限制（例如，目前不支持并行代码和 fastmath ）。 Keep in mind that Numba's main use case is Just-in-Time compilation and not the ahead-of-time compilation.请记住， Numba 的主要用例是即时编译，而不是提前编译。

Regarding your use-case, using Cython appears to make much more sense : the functions are pre-compiled once for some generic platforms and the compiled binaries can directly be provided to users without the need for recompilation on the target machine.关于您的用例，使用 Cython 似乎更有意义：对于某些通用平台，这些函数被预编译一次，编译后的二进制文件可以直接提供给用户，而无需在目标机器上重新编译。

I don't care if the code is a little slower on a PC that doesn't have my specific processor.我不在乎代码在没有我的特定处理器的 PC 上是否有点慢。

Well, regarding your code, using a "generic" x86-64 code can be much slower.好吧，关于您的代码，使用“通用”x86-64 代码可能会慢得多。 The main reasons lie in the use of SIMD instructions .主要原因在于SIMD指令的使用。 Indeed, x86-64 processors all supports the SSE2 instruction set which provide basic 128-bit SIMD registers working on integers and floating-point numbers.实际上，x86-64 处理器都支持 SSE2 指令集，该指令集提供基本的 128 位 SIMD 寄存器，用于处理整数和浮点数。 Since about a decade, x86-processors supports the 256-bit AVX instruction set which significantly speed up floating-point computations.大约十年以来，x86 处理器支持 256 位 AVX 指令集，可显着加快浮点计算。 Since at least 7 years, almost all mainstream x86-64 processors supports the AVX-2 instruction set which mainly speed up integer computations (although it also improves floating-point thanks to new features).至少 7 年以来，几乎所有主流 x86-64 处理器都支持主要加速整数计算的 AVX-2 指令集（尽管由于新功能它也改进了浮点）。 Since nearly a decade, the FMA instruction set can speed up codes using fuse-multiply adds by a factor of 2. Recent Intel processors support the 512-bit AVX-512 instruction set which not only double the number of items that can be computed per instruction but also adds many useful features.近十年来，FMA 指令集可以使用 fuse-multiply add 将代码速度提高 2 倍。最近的 Intel 处理器支持 512 位 AVX-512 指令集，这不仅使每次可以计算的项目数翻倍指令还增加了许多有用的功能。 In the end, SIMD-friendly codes can be up to an order of magnitude faster with the new instruction sets compared to the obsolete "generic" SSE2 instruction set.最后，与过时的“通用”SSE2 指令集相比，新指令集的 SIMD 友好代码可以快一个数量级。 Compilers (eg. GCC, Clang, ICC) are meant to generate a portable code by default and thus they only use SSE2 by default.编译器（例如 GCC、Clang、ICC）在默认情况下旨在生成可移植代码，因此默认情况下它们仅使用 SSE2。 Note that Numpy already uses such "new" features to speed up a lot many functions (see sorts , argmin/argmax , log/exp , etc.).请注意，Numpy 已经使用此类“新”功能来加速许多功能（请参阅sorts 、 argmin/argmax 、 log/exp等）。