优化蒙特卡洛算法 | 减少 GPU 和特征值问题的操作 | 多体问题

Question

This issue reminds some typical many-body problem, but with some extra calculations.这个问题提醒一些典型的多体问题，但有一些额外的计算。

I am working on the generalized Metropolis Monte-Carlo algorithm for the modeling of large number of arbitrary quantum systems (ma.netic ions for example) interacting classically with each other.我正在研究广义 Metropolis Monte-Carlo 算法，用于对大量任意量子系统（例如磁性离子）进行经典交互建模。 But it actually doesn't matter for the question.但这实际上与问题无关。

There is more than 100000 interacting objects, each one can be described by a coordinate and a set of parameters describing its current state r_i , s_i .有超过 100000 个交互对象，每个对象都可以通过一个坐标和一组描述其当前 state r_i , s_i的参数来描述。

Can be translated to the C++CUDA as float4 and float4 vectors可以转换为 C++CUDA 作为float4和float4向量

To update the system following Monte-Carlo method for such systems, we need to randomly sample 1 object from the whole set;为了按照蒙特卡洛方法对此类系统更新系统，我们需要从整个集合中随机抽取 1 object； calculate the interaction function for it f(r_j - r_i, s_j) ;为它计算交互作用 function f(r_j - r_i, s_j) ； substitute to some matrix and find eigenvectors of it, from which one a new state will be calculated.代入某个矩阵并找到它的特征向量，从中可以计算出一个新的 state。

The interaction is additive as usual, ie the total interaction will be the sum between all pairs.交互像往常一样是可加的，即总交互将是所有对之间的总和。

Formally this can be decomposed into steps形式上这可以分解为步骤

Generate random number i生成随机数i
Calculate the interaction function for all possible pairs f(r_j - r_i, s_j)计算所有可能对f(r_j - r_i, s_j)的交互作用 function
Sum it.总结一下。 The result will be a vector F结果将是向量F
Multiply it by some tensor and add another one h = h + dot(F,t) .将它乘以一些张量并添加另一个h = h + dot(F,t) 。 Some basic linear algebra stuff.一些基本的线性代数知识。
Find the eigenvectors and eigenvalues, based on some simple algorithm, choose one vector V_k and write in back to the array s_j of all objects's states.找到特征向量和特征值，基于一些简单的算法，选择一个向量V_k并写回所有对象状态的数组s_j 。

There is a big question, which parts of this can be computed on CUDA kernels.有一个大问题，其中哪些部分可以在 CUDA 个内核上计算。

I am quite new to CUDA programming.我对 CUDA 编程很陌生。 So far I ended up with the following algorithm到目前为止，我最终得到了以下算法

//a good random generator
std::uniform_int_distribution<std::mt19937::result_type> random_sampler(0, N-1);

for(int i=0; i\<a_lot; ++i) {
  //sample a number of object
  nextObject = random_sampler(rng);

  //call kernel to calculate the interaction and sum it up by threads. also to write down a new state back to the d_s array
  CUDACalcAndReduce<THREADS><<<blocksPerGrid, THREADS>>>(d_r, d_s, d_sum, newState, nextObject, previousObject, N);

  //copy the sum
  cudaMemcpy(buf, d_sum, sizeof(float)*4*blocksPerGrid, cudaMemcpyDeviceToHost);

  //manually reduce the rest of the sum
  total = buf[0];
  for (int i=1; i<blocksPerGrid; ++i) {
    total += buf[i];
  }

  //find eigenvalues and etc. and determine a new state of the object 
  //just linear algebra with complex numbers
  newState = calcNewState(total);

  //a new state will be written by CUDA function on the next iteration

  //remember the previous number of the object
  previousObject = nextObject;

}

The problem is continuous transferring data between CPU and GPU, and the actual number of bytes is blocksPerGrid*4*sizeof(float) which sometimes is just a few bytes.问题是 CPU 和 GPU 之间不断传输数据，实际字节数是blocksPerGrid*4*sizeof(float)有时只是几个字节。 I optimized CUDA code following the guide from NVIDIA and now it limited by the bus speed between CPU and GPU. I guess switching to pinned memory type will not make any sense since the number of transferred bytes is low.我按照NVIDIA 的指南优化了 CUDA 代码，现在它受限于 CPU 和 GPU 之间的总线速度。我猜切换到pinned memory 类型将没有任何意义，因为传输的字节数很低。

I used Nvidia Visual Profiler and it shows the following我使用了 Nvidia Visual Profiler，它显示了以下内容

the most time was waisted by the transferring the data to CPU.将数据传输到 CPU 浪费了最多的时间。 The speed as one can see by the inset is 57.143 MB/s and the size is only 64B !如图所示，速度为57.143 MB/s ，大小仅为64B ！

The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?问题是是否值得将特征值算法的逻辑移动到 CUDA kernel？

Therefore there will be no data transfer between CPU and GPU. The problem with this algorithm, you can update only one object per iteration.因此 CPU 和 GPU 之间不会有数据传输。这个算法的问题是，每次迭代只能更新一个 object。 It means that I can run the eigensolver only on one CUDA core.这意味着我只能在一个 CUDA 核心上运行特征求解器。 ;( Will it be that slow compared to my CPU, that will eliminate the advantage of keeping data inside the GPU ram? ;( 与我的 CPU 相比，它是否会那么慢，这将消除将数据保存在 GPU 内存中的优势？

The matrix size for the eigensolver algorithm does not exceed 10x10 complex numbers.特征求解器算法的矩阵大小不超过 10x10 复数。 I've heard that cuBLAS can be run fully on CUDA kernels without calling the CPU functions, but not sure how it is implemented.我听说 cuBLAS 可以在 CUDA 内核上完全运行而无需调用 CPU 函数，但不确定它是如何实现的。

UPD-1 As it was mentioned in the comment section. UPD-1正如评论部分中提到的那样。 For the each iteration we need to diagonalize only one 10x10 complex Hermitian matrix, which depends on the total calculated interaction function f .对于每次迭代，我们只需要对角化一个 10x10 复数 Hermitian 矩阵，这取决于计算的总交互作用 function f 。 Then, we in general it is not allowed to a compute a new sum of f , before we update the state of the sampled object based on eigenvectors and eigenvalues of 10x10 matrix.然后，在我们根据 10x10 矩阵的特征向量和特征值更新采样 object 的 state 之前，我们通常不允许计算新的f和。

Due to the stochastic nature of Monte-Carlo approach we need all 10 eigenvectors to pick up a new state for the sampled object.由于蒙特卡洛方法的随机性，我们需要所有 10 个特征向量来为采样的 object 选择一个新的 state。

However, the suggested idea of double-buffering (in the comments) can work out in a way if we calculate the total sum of f for the next j-th iteration without the contribution of i-th sampled object and, then, add it later.但是，如果我们在没有第i-th采样 object 的贡献的情况下计算下一个j-th迭代的f的总和，然后添加它，那么双缓冲的建议想法（在评论中）可以以某种方式解决之后。 I need to test it carefully in action...我需要在行动中仔细测试它......

UPD-2 The specs are UPD-2规格是

CPU 4-cores Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz CPU 4 核 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
GPU GTX960 GPU GTX960

quite outdated, but I might find an access to the better system.已经过时了，但我可能会找到更好的系统。 However, switching to GTX1660 SUPER did not affect the performance, which means that a PCI bus is a bottleneck;)然而，切换到 GTX1660 SUPER 并没有影响性能，这意味着 PCI 总线是一个瓶颈；）

Answer 1

The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?问题是是否值得将特征值算法的逻辑移动到 CUDA kernel？

Depends on the system.取决于系统。 Old cpu + new gpu?旧cpu+新gpu？ Both new?两个都是新的？ Both old?都老了？

Generally single cuda thread is a lot slower than single cpu thread.通常单个 cuda 线程比单个 cpu 线程慢很多。 Because cuda compiler does not vectorize its loops but host c++ compiler vectorizes.因为 cuda 编译器不对其循环进行矢量化，但主机 c++ 编译器对其进行矢量化。 So, you need to use 10-100 cuda threads to make the comparison fair.因此，您需要使用 10-100 个 cuda 线程才能使比较公平。

For the optimizations:对于优化：

According to the image, currently it loses 1 microsecond as a serial part of overall algorithm.根据图像，目前它作为整体算法的串行部分损失了 1 微秒。 1 microsecond is not much compared to the usual kernel-launch latency from CPU but is big when it is GPU launching the kernel (dynamic parallelism) itself. 1 微秒与通常的 CPU 内核启动延迟相比并不多，但当 GPU 启动 kernel（动态并行）本身时，1 微秒就很大了。

CUDA-graph feature enables the overall algorithm re-launch every step(kernel) automatically and complete quicker if steps are not CPU-dependent. CUDA-graph 功能使整个算法能够自动重新启动每个步骤（内核），如果步骤不依赖于 CPU，则可以更快地完成。 But it is intended for "graph"-like workloads where some kernel leads to multiple kernels and they later join in another kernel, etc.但它适用于类似“图形”的工作负载，其中一些 kernel 导致多个内核，然后它们加入另一个 kernel，等等。

CUDA-dynamic-parallelism feature lets a kernel's cuda threads launch new kernels. CUDA 动态并行特性允许内核的 cuda 个线程启动新内核。 This has much better timings than launching from CPU due to not waiting for the synchronizations between driver and host.由于不等待驱动程序和主机之间的同步，这比从 CPU 启动的时间要好得多。

Sampling part's copying could be made in chunks like 100-1000 elements at once and consumed by CUDA part at once for 100-1000 steps if all parts are in CUDA.如果所有部分都在 CUDA 中，采样部分的复制可以像 100-1000 个元素一样一次进行，并且一次被 CUDA 部分消耗 100-1000 步。

If I were to write it, I would do it like this:如果我要写它，我会这样做：

launch a loop kernel (only 1 CUDA thread) that is parent启动一个父循环 kernel（只有 1 个 CUDA 线程）
start loop in the kernel在 kernel 开始循环
do real (child) kernel-launching within the loop在循环中执行真正的（子）内核启动
since every iteration needs serial, it should sync before continuing next iteration.由于每次迭代都需要串行，因此在继续下一次迭代之前应该同步。
end the parent after 100-1000 sized chunk is complete and get new random data from CPU在 100-1000 大小的块完成后结束父级并从 CPU 获取新的随机数据

when parent kernel ends, it shows in profiler as a single kernel launch that takes a lot of time and it doesn't have any CPU-based inefficiencies.当父 kernel 结束时，它在探查器中显示为单个 kernel 启动需要很多时间并且没有任何基于 CPU 的低效率。

On top of the time saved from not synching a lot, there would be consistency of performance between 10x10 matrix part and the other kernel part because they are always in same hardware, not some different CPU and GPU.除了因不同步而节省的时间之外，10x10 矩阵部分和另一个 kernel 部分之间的性能将保持一致，因为它们始终在相同的硬件中，而不是一些不同的 CPU 和 GPU。

Since random-num generation is always an input for the system, at least it can be double-buffered to hide cpu-to-gpu data copying latency behind the computation.由于随机数生成始终是系统的输入，因此至少可以对其进行双缓冲以隐藏计算背后的 cpu 到 gpu 数据复制延迟。 Iirc, random number generation is much cheaper than sending data over pcie bridge. IIrc，随机数生成比通过 pcie 桥发送数据便宜得多。 So this would hide mostly the data transmission slowness.所以这将主要隐藏数据传输速度慢的问题。

If it is a massively parallel experiment like running the executable N times, you can still launch like 10 executable instances at once and let them keep gpu busy with good efficiency.如果它是一个大规模并行实验，比如运行 N 次可执行文件，您仍然可以同时启动 10 个可执行实例，并让它们高效地保持 gpu 忙碌。 Not practical if too much memory is required per instance.如果每个实例需要太多 memory 则不切实际。 Many gpus except ancient ones can run tens of kernels in parallel if each of them can not fully occupy all resources of gpu.如果每个内核不能完全占用 gpu 的所有资源，除了古老的之外，许多 gpu 可以并行运行数十个内核。

优化蒙特卡洛算法 | 减少 GPU 和特征值问题的操作 | 多体问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2023-01-09 20:03:32

优化蒙特卡洛算法 | 减少 GPU 和特征值问题的操作 | 多体问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2023-01-09 20:03:32

解决方案1
0 已采纳 2023-01-09 20:03:32