[英]Mathematica/CUDA reduce execution time
I'm writing a simple monte carlo simulation for particle transport. 我正在为粒子传输编写一个简单的蒙特卡罗模拟。 My approach is writing a kernel for CUDA and execute it as a Mathematica function. 我的方法是为CUDA编写内核并将其作为Mathematica函数执行。
Kernel: 核心:
#include "curand_kernel.h"
#include "math.h"
extern "C" __global__ void monteCarlo(Real_t *transmission, mint seed, mint pathN) {
curandState rngState;
int index = threadIdx.x + blockIdx.x*blockDim.x;
curand_init(seed, index, 0, &rngState);
if (index < pathN) {
//-------------start one packet run----------------------
float packetWeight = 1.0;
int m = 0;
while(packetWeight > 0.0){
//MONTE CARLO CODE
// Test: still in the sample?
if(z_coordinate > sampleThickness){
packetWeight = 0;
z_coordinate = sampleThickness;
transmission[index]=1;
}
}
}
//-------------end one packet run------------------------
}
}
Mathematica code: Mathematica代码:
Needs["CUDALink`"];
cudaBM = CUDAFunctionLoad[code,
"monteCarlo", {{_Real, "Output"}, _Integer, _Integer}, 256,
"UnmangleCode" -> False];
pathN = 100000;
result = 0; (*count for transmitted particles*)
For[j = 0, j < 10, j++,
buffer = CUDAMemoryAllocate["Float", 100000];
cudaBM[buffer, 1490, pathN];
resultOneRun = Total[CUDAMemoryGet[buffer]];
result = result + resultOneRun;
];
Everything seems to work so far, but the speed improvement compared to the pure C code without CUDA is neglible. 到目前为止,一切似乎都有效,但与没有CUDA的纯C代码相比,速度的提升是微不足道的。 I have two problems: 我有两个问题:
1) If you need to execute curand_init() once for all threads, can you just do that in the CPU and pass that as an argument to CUDA? 1)如果你需要为所有线程执行一次curand_init(),你可以在CPU中执行该操作并将其作为参数传递给CUDA吗?
2) How about a " device float sumTotal" function which sums and returns your values? 2)“ 设备浮点数sumTotal”函数如何汇总和返回您的值? Have you copied as much *transmission data into a shared memory buffer? 您是否已将*传输数据复制到共享内存缓冲区中?
As per CURAND docs, "Calls to curand_init() are slower than calls to curand() or curand_uniform(). Large offsets to curand_init() take more time than smaller offsets. It is much faster to save and restore random generator state than to recalculate the starting state repeatedly." 根据CURAND文档,“对curand_init()的调用比调用curand()或curand_uniform()要慢。对curand_init()的大偏移比较小的偏移需要更多的时间。保存和恢复随机生成器状态要快得多。重复重新计算起始状态。“
http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_4 http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_4
Also please look into this thread for more details CUDA program causes nvidia driver to crash 另请查看此主题以获取更多详细信息CUDA程序会导致nvidia驱动程序崩溃
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.