cuda 11.6 中的赋值运行极慢

Question

我正在 cuda 中测试定制的 Z 缓冲区 kernel。 简而言之：检查X个点是否在Y个多边形内，并返回每个点的前多边形 ID。 并行部分是用所有多边形计算每个点。

整个过程没有错误，它确实将数据传输到设备并将正确的结果返回给主机。

但是我在最后一行values[i] = val;中发现了巨大的时间消耗。

这个问题实际上很愚蠢。 我相信我在 kernel 中分配值的方法是错误的。 你能建议做这个任务的正确方法吗？

非常感谢！

为了更好地理解 kernel 中的数据结构：

float* position_dist存储：

首先是point0, point1, ... ,point(X-1)序列中的所有测试点 x,y,z
然后在poly0.p0, dist0, poly0.p1, dist0, poly0.p2, dist0, poly0.p3, dist0, ... , poly(Y-1).p0, dist(Y-1), poly(Y-1).p1, dist(Y-1), poly(Y-1).p2, dist(Y-1), poly(Y-1).p3, dist(Y-1) 。 因此，每个多边形在偏移时有 16 个值。

int* values存储：默认值为 -1。 它将更新为polyID并返回给主机。

CUDA_GLOBAL void computeOcclusion_kernel(float* position_dist, int* values, int numPt, int numPositions)
{
    uint i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < numPt && i % 3 == 0)
    {
        Point pt(position_dist[i + 0], position_dist[i + 1], position_dist[i + 2]);

        uint offset = numPt - i;

        float dist = 10000000;
        int val = -1;
        for (int o = i; o < i + numPositions; o += 16)
        {
            int j = o + offset;
            int polyID = (j - numPt) / 16;

            Point p0(position_dist[j + 0], position_dist[j + 1], position_dist[j + 2]);
            Point p1(position_dist[j + 4], position_dist[j + 5], position_dist[j + 6]);
            Point p2(position_dist[j + 8], position_dist[j + 9], position_dist[j + 10]);
            Point p3(position_dist[j + 12], position_dist[j + 13], position_dist[j + 14]);

            if (position_dist[j + 3] < dist)
            {
                if(inPoly(pt,p0,p1,p2,p3))
                {
                    val = polyID;
                    dist = position_dist[j + 3];
                }
            }
        }
        values[i] = val;
    }
}

Answer 1

感谢阿金的评论。 我通过使用连续的 memory 位置和动态共享内存重组了 kernel。

它现在工作正常。

CUDA_GLOBAL void computeOcclusion_kernel(float* position_dist, int* values, int numPt, int numPositions)
{
    extern __shared__ int shared[];

    for (uint i = blockIdx.x * blockDim.x + threadIdx.x;
        i < numPt/ 3;
        i += blockDim.x * gridDim.x)
    {
        uint t = i * 3;
        Point pt(position_dist[t + 0], position_dist[t + 1], position_dist[t + 2]);

        uint offset = numPt - t;

        float dist = 10000000;
        shared[i] = -1;

        for (int o = i; o < t + numPositions; o += 16)
        {
            int j = o + offset;
            int polyID = (j - numPt) / 16;

            Point p0(position_dist[j + 0], position_dist[j + 1], position_dist[j + 2]);
            Point p1(position_dist[j + 4], position_dist[j + 5], position_dist[j + 6]);
            Point p2(position_dist[j + 8], position_dist[j + 9], position_dist[j + 10]);
            Point p3(position_dist[j + 12], position_dist[j + 13], position_dist[j + 14]);

            if (position_dist[j + 3] < dist)
            {
                if(inPoly(pt,p0,p1,p2,p3))
                {
                    shared[i] = polyID;
                    dist = position_dist[j + 3];
                }
            }
        }
        __syncthreads();
        buffervalues[i] = shared[i];
    }
}

还需要在 kernel 启动中声明动态共享 memory 大小。

computeOcclusion_kernel <<< grid, block, NUM_BUFFER/3*sizeof(int) >>>...

但是，动态共享 memory 大小只能是我所理解的最大 48kb - 在这种情况下，我的缓冲区大小仅适用于 48KB/sizeof(int) = 1200。

这比我要求的要少得多。 我需要类似：buffer_X * buffer_Y * 4 = 4MB。

我将研究这样做的替代策略。

cuda 11.6 中的赋值运行极慢

问题描述

1 个解决方案

解决方案1
0 2022-08-19 21:05:05

cuda 11.6 中的赋值运行极慢

问题描述

1 个解决方案

解决方案1 0 2022-08-19 21:05:05

解决方案1
0 2022-08-19 21:05:05