你可以 GPU 加速 python 中的简单数学方程，例如：y = 1/x，你是怎么做的？

Question

Can I use the cores in my GPU to accelerate this problem and speed it up?我可以使用我的 GPU 中的内核来加速这个问题并加速它吗？ If so, how do I do it?如果是这样，我该怎么做？ At around 10 trillion a single thread in my CPU just can't do it, which is why I want to accelerate it with my GPU.在我的 CPU 中大约 10 万亿个单线程无法做到这一点，这就是为什么我想用我的 GPU 来加速它。 I am also interested in seeing any multi threaded CPU answers, but I really want to see it done on the GPU.我也有兴趣看到任何多线程 CPU 的答案，但我真的很想看到它在 GPU 上完成。 Ideally I'd like the answers to be as simple as possible.理想情况下，我希望答案尽可能简单。

My code:我的代码：

y=0

for x in range(1, 10000000000):
    y += 1/x

print(y)

Answer 1

Yes, this operation can be done one GPU using a basic parallel reduction .是的，这个操作可以在一个 GPU 上使用基本的并行缩减来完成。 In fact, it is well suited on GPUs since it is mainly embarrassingly parallel and heavily makes use of floating-point/integer operations.事实上，它非常适合 GPU，因为它主要是令人尴尬的并行，并且大量使用浮点/整数运算。

Note that the convergence of such basic sequence is AFAIK well known analytically (as pointed out by @jfaccioni) and thus you should prefer analytical solutions that are generally far cheaper to compute.请注意，这种基本序列的收敛在分析上是众所周知的 AFAIK（正如@jfaccioni 所指出的那样），因此您应该更喜欢通常计算成本低得多的分析解决方案。 Note also that client-side GPUs are not great to efficiently compute 64-bit floating-point (FP) numbers so you should generally use 32-bit ones so to see a speed-up at the expense of a lower precision .另请注意，客户端 GPU 不能很好地有效计算 64 位浮点 (FP) 数，因此您通常应该使用 32 位的，以便以较低的精度为代价来提高速度。 That being said, server-side GPUs can compute 64-bit FP numbers efficiently so the best solution depends of the hardware you actually have.话虽如此，服务器端 GPU 可以有效地计算 64 位 FP 数，因此最佳解决方案取决于您实际拥有的硬件。

Nvidia GPU are usually programmed with CUDA which is pretty low-level compared to a basic pure-Python code. Nvidia GPU 通常使用 CUDA 进行编程，与基本的纯 Python 代码相比，它的级别相当低。 There are Python wrapper and higher-level library but in this case most are not efficient since they will cause unnecessary memory loads/stores or other overheads.有 Python 包装器和更高级别的库，但在这种情况下大多数效率不高，因为它们会导致不必要的内存加载/存储或其他开销。 AFAIK, PyCUDA and Numba are likely the best tool for that yet. AFAIK、PyCUDA 和 Numba 可能是目前最好的工具。 If your GPU is not an Nvidia GPU, then you can use libraries based on OpenCL (as CUDA is not really well supported on non-Nvidia GPUs yet).如果您的 GPU 不是 Nvidia GPU，那么您可以使用基于 OpenCL 的库（因为 CUDA 在非 Nvidia GPU 上还没有得到很好的支持）。

Numba support high-level reduction so it can be done very easily (note that Numba use CUDA internally so you need an Nvidia GPU): Numba 支持高级缩减，因此可以非常轻松地完成（注意 Numba 在内部使用 CUDA，因此您需要 Nvidia GPU）：

from numba import cuda

# "out" is an array with 1 FP item that must be initialized to 0.0
@cuda.jit
def vec_add(out):
    x = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bdx = cuda.blockDim.x
    i = bx * bdx + x
    if i < 10_000_000_000-1:
        numba.cuda.atomic.add(out, 0, i+1)

This is only the GPU kernel, not the whole code.这只是 GPU 内核，而不是整个代码。 For more information about how to run it please read the documentation .有关如何运行它的更多信息，请阅读文档。 In general, one need to care about data transfer, allocation kernel dependencies, streams, etc. Keep in mind that GPUs are hard to program (efficiently) .一般来说，需要关心数据传输、分配内核依赖、流等。请记住， GPU 很难（有效地）编程。 This kernel is simple but clearly not optimal , especially on old GPU without hardware atomic acceleration units.这个内核很简单，但显然不是最优的，尤其是在没有硬件原子加速单元的旧 GPU 上。 To write a faster kernel, you need to perform a local reduction using an inner loop.要编写更快的内核，您需要使用内部循环执行局部归约。 Also note that C++ is much better to write efficient kernel codes, especially with libraries like CUB (based on CUDA) which supports iterators and high-level flexible efficient primitives.另请注意，C++ 更适合编写高效的内核代码，尤其是使用支持迭代器和高级灵活高效原语的库（基于 CUDA）时。

Note that Numba can also be used to implement fast parallel CPU codes.请注意，Numba 也可用于实现快速并行 CPU 代码。 Here is an example:这是一个例子：

import numba as nb

@nb.njit('float64(int64)', fastmath=True, parallel=True)
def compute(limit):
    y = 0.0
    for x in nb.prange(1, limit):
        y += 1 / x
    return y

print(compute(10000000000))

This takes only 0.6 seconds on my 10-core CPU machine .这在我的 10 核 CPU 机器上只需要 0.6 秒。 CPU codes have the benefit to be simpler, easier to maintain, more portable and more flexible despite being possibly slower. CPU 代码的优点是更简单、更易于维护、更便携和更灵活，尽管速度可能更慢。

Answer 2

(Nvidia CUDA) GPU's can be used with the proper modules installed. (Nvidia CUDA) GPU 可以与安装的适当模块一起使用。 However a standard multiprocessing solution (not multithreading since this is a computation-only task) is easy enough to achieve, and reasonably efficient:然而，一个标准的多处理解决方案（不是多线程，因为这是一个仅计算任务）很容易实现，并且相当有效：

import multiprocessing as mp

def inv_sum(start):
  return sum(1/x for x in range(start,start+1000000))

def main():
  pool = mp.Pool(mp.cpu_count())
  result = sum(pool.map(inv_sum, range(1,1000000000,1000000)))
  print(result)

if __name__ == "__main__":
  main()

I didn't dare to test run it on one trillion , but one billion on my 8-core i-5 laptop runs in about 20 seconds我不敢在 1万亿上测试运行它，但在我的 8 核 i-5 笔记本电脑上运行10 亿只需要大约 20 秒

你可以 GPU 加速 python 中的简单数学方程，例如：y = 1/x，你是怎么做的？

问题描述

2 个解决方案

解决方案1
2 2022-06-23 16:46:59

解决方案2
1 2022-06-23 16:42:53

你可以 GPU 加速 python 中的简单数学方程，例如：y = 1/x，你是怎么做的？

问题描述

2 个解决方案

解决方案1 2 2022-06-23 16:46:59

解决方案2 1 2022-06-23 16:42:53

解决方案1
2 2022-06-23 16:46:59

解决方案2
1 2022-06-23 16:42:53