繁体   English   中英

内核代码中矩阵的CUDA乘法

[英]CUDA multiple multiplication of the matrix in the kernel code

矩阵乘法的功能:

__global__ void gpu_matrix_mult(float *a, float *b, float *c, int m, int n, int k)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0;
    if (col < k && row < m)
    {
        for (int i = 0; i < n; i++)
        {
            sum += a[row * n + i] * b[i * k + col];
        }
        c[row * k + col] = sum;
    }
}

然后在以下循环中调用该函数:

int currentActivityCount = -1;

while (activityCount != currentActivityCount)
{
    if (currentActivityCount > -1)
    {
        cudaMemcpy(d_b, h_b_new, sizeof(int)*m*k, cudaMemcpyHostToDevice);
    }

    gpu_matrix_mult << <dimGrid, dimBlock >> >(d_a, d_b, d_c, m, n, k);

    cudaMemcpy(h_c, d_c, sizeof(int)*m*k, cudaMemcpyDeviceToHost);

    currentActivityCount = activityCount;
    activityCount = 0;

    for (int i = 0; i < m; ++i)
    {
        for (int j = 0; j < k; ++j)
        {
            if (h_c[i*k + j] >= 0.5)
            {
                activityCount++;

                h_b_new[i * k + j] = 1;
            }
            else
            {
                h_b_new[i * k + j] = 0;
            }
        }
    }

    during++;
    printf("Count of activity: %d During: %d\n", activityCount, during);
}

我的目标是将此循环移至“ gpu_matrix_mult”函数中,以使GPU之间的数据传输仅发生两次,这意味着在调用函数之前和之后,而不是在每次循环迭代中。 我一直在尝试一些方法,但没有一个奏效。 那解决办法可行吗?

您可以在内核中执行以下操作:

__device__ int activityCount;
__global__ void gpu_matrix_mult(float *a, float *b0, float *b1, float *c, int m, int n, int k)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0;
    if (col < k && row < m)
    {
        for (int i = 0; i < n; i++)
        {
            sum += a[row * n + i] * b0[i * k + col];
        }
        c[row * k + col] = sum;
        if (sum >= 0.5)
        {
            atomicAdd(&activityCount, 1);
            b1[i * k + j] = 1;
        }
        else
        {
            b1[i * k + j] = 0;
        }
    }
}

// .............


int currentActivityCount = -1;
int activityCount_h = 0;
while (activityCount_h != currentActivityCount)
{
    if (currentActivityCount > -1)
    {
        float *tmp = d_b0;
        d_b0 = d_b1;
        d_b1 = tmp;
    }
    currentActivityCount = activityCount_h;
    activityCount_h = 0;
    cudaMemcpyToSymbol(activityCount, &activityCount_h, sizeof(int));
    gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b0, d_b1, d_c, m, n, k);
    cudaMemcpyfromSymbol(&activityCount_h, activity, sizeof(int));

    during++;
    printf("Count of activity: %d During: %d\n", activityCount, during);
}

[显然从未编译或运行,使用后果自负]

也就是说,在矩阵相乘之后,可以在设备上的内核中运行用于计算activityCount的内部循环。 这在GPU的内存中需要两个b矩阵,但是主机上只需要指针交换即可更新它们,这基本上是零成本。 每次外部循环迭代两次,内存传输减少为单个整数,这将相当快。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM