使用不规则的内存访问优化CUDA内核

Question

我有以下CUDA内核，它似乎非常“难以”优化：

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai )
{
    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < filter_size; idx+=blockDim.x * gridDim.x)
    {
        int index = (idx * ai) & (n-1);
        d_origx_remap[idx] = d_origx[index];
    }
}

//Parameters were defined before
int permute[loops] = {29165143,3831769,17603771,9301169,32350975, ...}
int n = 33554432;
int filter_size = 1783157;

for(int i=0; i<loops; i++)
{
    DataLayoutTransformKernel<<<dimGrid, dimBlock, 0, stream[i]>>>((cuDoubleComplex*) d_origx,(cuDoubleComplex*)d_origx_remap+i*filter_size, n, filter_size, permute[i]);

}

内核的目的是将d_origx[]的数据布局从不规则重新排序到常规（ d_origx_remap ）。 内核使用不同的访问步幅（ ai ）多次启动。

这里的挑战是引用d_origx[index]数组时的不规则内存访问模式。 我的想法是使用共享内存。 但是对于这种情况，使用共享内存来合并全局内存访问似乎非常困难。

有没有人有关于如何优化这个内核的建议？

Answer 1

Trove库是一个支持AoS支持的CUDA / C ++库，可能为随机AoS访问提供接近最佳性能。 从GitHub页面看起来，对于16字节结构，trove将比天真的方法大约2倍。

https://github.com/BryanCatanzaro/trove

使用Trove的随机访问性能与天真的直接访问方法相比

Answer 2

我不确定你可以做多少来优化你的代码。

根本没有线程合作，所以我会说共享内存不是可行的方法。

你可以尝试改变

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai)

至

__global__ void DataLayoutTransformKernel(const cuDoubleComplex* __restrict__ d_origx, cuDoubleComplex* __restrict__ d_origx_remap, const int n, const int filter_size, const int ai)

即，使用const和__restrict__关键字。 特别是__restrict__将使nvcc能够执行一些优化，参见CUDA C编程指南的B.2节。 对于Kepler体系结构， const和__restrict关键字可以由编译器标记为通过只读数据高速缓存加载，请参阅Kepler体系结构白皮书。

使用不规则的内存访问优化CUDA内核

问题描述

2 个解决方案

解决方案1
5 2013-12-11 06:54:48

解决方案2
1 2013-12-11 18:29:07

使用不规则的内存访问优化CUDA内核

问题描述

2 个解决方案

解决方案1 5 2013-12-11 06:54:48

解决方案2 1 2013-12-11 18:29:07

解决方案1
5 2013-12-11 06:54:48

解决方案2
1 2013-12-11 18:29:07