简体   繁体   English

使用不规则的内存访问优化CUDA内核

[英]Optimizing a CUDA kernel with irregular memory accesses

I have the following CUDA kernel which seems very "tough" to optimize: 我有以下CUDA内核,它似乎非常“难以”优化:

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai )
{
    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < filter_size; idx+=blockDim.x * gridDim.x)
    {
        int index = (idx * ai) & (n-1);
        d_origx_remap[idx] = d_origx[index];
    }
}

//Parameters were defined before
int permute[loops] = {29165143,3831769,17603771,9301169,32350975, ...}
int n = 33554432;
int filter_size = 1783157;

for(int i=0; i<loops; i++)
{
    DataLayoutTransformKernel<<<dimGrid, dimBlock, 0, stream[i]>>>((cuDoubleComplex*) d_origx,(cuDoubleComplex*)d_origx_remap+i*filter_size, n, filter_size, permute[i]);

}

The purpose of the kernel is to reorder the data layout of d_origx[] from irregular to regular ( d_origx_remap ). 内核的目的是将d_origx[]的数据布局从不规则重新排序到常规( d_origx_remap )。 The kernel is launched several times with different access strides ( ai ). 内核使用不同的访问步幅( ai )多次启动。

The challenge here is the irregular memory access pattern in referring the array of d_origx[index] . 这里的挑战是引用d_origx[index]数组时的不规则内存访问模式。 My idea was to use shared memory. 我的想法是使用共享内存。 But for this case it seems very hard to use shared memory to coalesce global memory access. 但是对于这种情况,使用共享内存来合并全局内存访问似乎非常困难。

Does anyone have suggestions on how to optimize this kernel? 有没有人有关于如何优化这个内核的建议?

The Trove library is a CUDA/C++ library with support for AoS support, and likely gives close to optimal performance for random AoS access. Trove库是一个支持AoS支持的CUDA / C ++库,可能为随机AoS访问提供接近最佳性能。 From the GitHub page it looks like trove will get about 2x over the naive approach for 16-byte structures. 从GitHub页面看起来,对于16字节结构,trove将比天真的方法大约2倍。

https://github.com/BryanCatanzaro/trove https://github.com/BryanCatanzaro/trove

使用Trove的随机访问性能与天真的直接访问方法相比

I'm not sure you can do much to optimize your code. 我不确定你可以做多少来优化你的代码。

There is not at all thread cooperations, so I would say that shared memory is not the way to go. 根本没有线程合作,所以我会说共享内存不是可行的方法。

You may try changing 你可以尝试改变

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai)

to

__global__ void DataLayoutTransformKernel(const cuDoubleComplex* __restrict__ d_origx, cuDoubleComplex* __restrict__ d_origx_remap, const int n, const int filter_size, const int ai)

ie, using the const and __restrict__ keywords. 即,使用const__restrict__关键字。 Particularly __restrict__ will enable nvcc to perform some optimizations, see Section B.2 of the CUDA C Programming Guide. 特别是__restrict__将使nvcc能够执行一些优化,参见CUDA C编程指南的B.2节。 For the Kepler architecture, the const and __restrict keyword may be tagged by the compiler to be loaded through the Read‐Only Data Cache, see the Kepler architecture whitepaper . 对于Kepler体系结构, const__restrict关键字可以由编译器标记为通过只读数据高速缓存加载,请参阅Kepler体系结构白皮书

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM