3D矩阵求和

Question

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n]. 我需要进行如下计算：A [x] [y] = sum {从z = 0到z = n} {B [x] [y] [z] + C [x] [y] [z]}，其中矩阵A的尺寸为[height] [width]，矩阵B，C的尺寸为[height] [width] [n]。

Values are mapped to memory with something like: 值通过以下方式映射到内存：

index = 0;
for (z = 0; z<n; ++z)
    for(y = 0; y<width; ++y)
        for(x = 0; x<height; ++x) {
            matrix[index] = value;
            index++;
        }

Q1: is this Cuda kernel ok? Q1：这个Cuda内核可以吗？

idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;

for(z=0; z<n; z++){
    A[idx*width+idy] += B[idx*width+idy+z*width*height] + C[idx*width+idy+z*width*height];
}

Q2: Is this faster way to do the calculation? 问题2：这是进行计算的较快方法吗？

idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
idz = blockIdx.z*blockDim.z + threadIdx.z;

int  stride_x = blockDim.x * gridDim.x;
int  stride_y = blockDim.y * gridDim.y;
int  stride_z = blockDim.z * gridDim.z;

while ( idx < height && idy < width && idz < n ) {
    atomicAdd( &(A[idx*width+idy]), B[idx*width+idy+idz*width*height] + C[idx*width+idy+idz*width*height] );
    idx += stride_x;
    idy += stride_y;
    idz += stride_z;
}

Answer 1

First kernel is ok. 第一个内核没问题。 But we have not coalesced access to matrix B and C . 但是我们尚未合并对矩阵B和C访问。

As for second kernel function. 至于第二内核功能。 You have data racing cause not only one thread has an an ability to write in A[idx*width+idy] addres. 您的数据竞争A[idx*width+idy]因为不仅有一个线程能够写入A[idx*width+idy]地址。 You need in additional synchronization like AttomicAdd 您需要其他同步，例如AttomicAdd

As for general question: I think that experiments show that it is better. 至于一般性问题：我认为实验表明效果更好。 It's depends on typical matrix sizes that you have. 这取决于您拥有的典型矩阵大小。 Remember that maximum thread block size on Fermi < 1024 and if matrices have large size you gem many thread blocks. 请记住，Fermi <1024上的最大线程块大小，并且如果矩阵的大小较大，则您会创建许多线程块。 Usually it's slower (to have many thread blocks). 通常它比较慢（有很多线程块）。

Answer 2

Real simple in ArrayFire : 在ArrayFire中真正简单：

array A = randu(nx,ny,nz);
array B = sum(A,2); // sum along 3rd dimension
print(B);

Answer 3

Q1: Test it with matrices where you know the answer Q1：用知道答案的矩阵进行测试

Remark: You might have problems when using very large matrices. 备注：使用非常大的矩阵时，您可能会遇到问题。 Use a while loop with appropriate increments. 使用while循环以适当的增量。 Cuda by Example is as usual the reference book. 与往常一样，Cuda by Example是参考书。

An example for implementing a nested loop can be found here: For nested loops with CUDA . 可在此处找到实现嵌套循环的示例：对于使用CUDA的嵌套循环。 There a while loop is implemented. 实施了while循环。

marina.k is right about the race condition. marina.k关于比赛条件是正确的。 That would favor approach one, as atomic operations tend to slow down the code. 由于原子操作会减慢代码速度，因此这将更倾向于方法一。

3D矩阵求和

问题描述

3 个解决方案

解决方案1
2 2012-03-30 10:38:24

解决方案2
2

解决方案3
1 2012-03-30 10:36:11

3D矩阵求和

问题描述

3 个解决方案

解决方案1 2 2012-03-30 10:38:24

解决方案2 2

解决方案3 1 2012-03-30 10:36:11

解决方案1
2 2012-03-30 10:38:24

解决方案2
2

解决方案3
1 2012-03-30 10:36:11