Mathematica / CUDA减少了执行时间

Question

I'm writing a simple monte carlo simulation for particle transport. 我正在为粒子传输编写一个简单的蒙特卡罗模拟。 My approach is writing a kernel for CUDA and execute it as a Mathematica function. 我的方法是为CUDA编写内核并将其作为Mathematica函数执行。

Kernel: 核心：

#include "curand_kernel.h"
#include "math.h"

extern "C" __global__ void monteCarlo(Real_t *transmission, mint seed, mint pathN) {
curandState rngState;

int index = threadIdx.x + blockIdx.x*blockDim.x;

curand_init(seed, index, 0, &rngState);

if (index < pathN) {
    //-------------start one packet run----------------------

    float packetWeight = 1.0;
    int m = 0;

    while(packetWeight > 0.0){

        //MONTE CARLO CODE

        // Test: still in the sample?
            if(z_coordinate > sampleThickness){
                packetWeight = 0;
                z_coordinate = sampleThickness;
                transmission[index]=1;
            }
        }
    }
    //-------------end one packet run------------------------
}
}

Mathematica code: Mathematica代码：

Needs["CUDALink`"];
cudaBM = CUDAFunctionLoad[code, 
"monteCarlo", {{_Real, "Output"}, _Integer, _Integer}, 256, 
"UnmangleCode" -> False];


pathN = 100000;
result = 0;  (*count for transmitted particles*)
For[j = 0, j < 10, j++,
   buffer = CUDAMemoryAllocate["Float", 100000];
   cudaBM[buffer, 1490, pathN];
   resultOneRun = Total[CUDAMemoryGet[buffer]];
   result = result + resultOneRun;
];

Everything seems to work so far, but the speed improvement compared to the pure C code without CUDA is neglible. 到目前为止，一切似乎都有效，但与没有CUDA的纯C代码相比，速度的提升是微不足道的。 I have two problems: 我有两个问题：

the curand_init() function is executed by all threads at the beginning of every sumulation step -> can I call this function once for all threads? curand_init（）函数由每个汇总步骤开始时的所有线程执行 - >我可以为所有线程调用此函数一次吗？
the kernel returns to Mathematica a very large array of reals (100 000). 内核向Mathematica返回了一大堆实数（10万）。 I know, that the bottleneck of CUDA is the channel bandwidth between GPU and CPU. 我知道，CUDA的瓶颈是GPU和CPU之间的通道带宽。 I need only the sum of all elements of the list, so it would be more efficient to calculate the sum of the list elements in the GPU and send only one real number to the CPU. 我只需要列表中所有元素的总和，因此计算GPU中列表元素的总和并仅向CPU发送一个实数会更有效。

Answer 1

1) If you need to execute curand_init() once for all threads, can you just do that in the CPU and pass that as an argument to CUDA? 1）如果你需要为所有线程执行一次curand_init（），你可以在CPU中执行该操作并将其作为参数传递给CUDA吗？

2) How about a " device float sumTotal" function which sums and returns your values? 2）“ 设备浮点数sumTotal”函数如何汇总和返回您的值？ Have you copied as much *transmission data into a shared memory buffer? 您是否已将*传输数据复制到共享内存缓冲区中？

Answer 2

As per CURAND docs, "Calls to curand_init() are slower than calls to curand() or curand_uniform(). Large offsets to curand_init() take more time than smaller offsets. It is much faster to save and restore random generator state than to recalculate the starting state repeatedly." 根据CURAND文档，“对curand_init（）的调用比调用curand（）或curand_uniform（）要慢。对curand_init（）的大偏移比较小的偏移需要更多的时间。保存和恢复随机生成器状态要快得多。重复重新计算起始状态。“

http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_4 http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_4

Also please look into this thread for more details CUDA program causes nvidia driver to crash 另请查看此主题以获取更多详细信息CUDA程序会导致nvidia驱动程序崩溃

Mathematica / CUDA减少了执行时间

问题描述

2 个解决方案

解决方案1
0 已采纳 2013-01-09 14:45:04

解决方案2
0 2013-01-09 19:44:43

Mathematica / CUDA减少了执行时间

问题描述

2 个解决方案

解决方案1 0 已采纳 2013-01-09 14:45:04

解决方案2 0 2013-01-09 19:44:43

解决方案1
0 已采纳 2013-01-09 14:45:04

解决方案2
0 2013-01-09 19:44:43