简体   繁体   English

我们如何在 CUDA 中访问 3D 数组的列?

[英]How do we access the column of a 3D array in CUDA?

    mod=SourceModule("""

   __global__ void mat_ops(float *A,float *B)
  {   /*formula to get unique thread index*/
      int thrd= blockIdx.x*blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadIdx.x;
      B[]=A[];
   }    
   """)
        func = mod.get_function("mat_ops")
        func(A_k, B_k, grid=(3,1,1),block=(4,4,1))

I have two 3D arrays float *A and float *B, each of size 4 X 4 X 3 in this PyCUDA kernel.我有两个 3D 数组 float *A 和 float *B,在这个 PyCUDA 内核中,每个数组的大小都是 4 X 4 X 3。 What I am trying to do here is, traverse the 3D array column by column, instead of row by row.我在这里尝试做的是,逐列遍历 3D 数组,而不是逐行遍历。 I am making use of a 1D Grid of 2D blocks.我正在使用二维块的一维网格。 How do I do this ?我该怎么做呢 ?

To do this, you need to describe to layout of the array in memory to the CUDA kernel, and you need the correct indexing calculations in the kernel using the host side provided strides.为此,您需要向 CUDA 内核描述内存中数组的布局,并且您需要使用主机端提供的步幅在内核中进行正确的索引计算。 A simple way to do this is to define a small helper class in CUDA which hides the bulk of the indexing and provides a simple indexing syntax.一个简单的方法是在 CUDA 中定义一个小的帮助类,它隐藏了大部分索引并提供了一个简单的索引语法。 For example:例如:

from pycuda import driver, gpuarray
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy as np

mod=SourceModule("""

   struct stride3D
   {
       float* p;
       int s0, s1;

       __device__
       stride3D(float* _p, int _s0, int _s1) : p(_p), s0(_s0), s1(_s1) {};

       __device__
       float operator  () (int x, int y, int z) const { return p[x*s0 + y*s1 + z]; };

       __device__
       float& operator () (int x, int y, int z) { return p[x*s0 + y*s1 + z]; };
   };

   __global__ void mat_ops(float *A, int sA0, int sA1, float *B, int sB0, int sB1)
   {
       stride3D A3D(A, sA0, sA1);
       stride3D B3D(B, sB0, sB1);

       int xidx = blockIdx.x;
       int yidx = threadIdx.x;
       int zidx = threadIdx.y;

       B3D(xidx, yidx, zidx) = A3D(xidx, yidx, zidx);
   }    
   """)

A = 1 + np.arange(0, 4*4*3, dtype=np.float32).reshape(4,4,3)
B = np.zeros((5,5,5), dtype=np.float32)
A_k = gpuarray.to_gpu(A)
B_k = gpuarray.to_gpu(B)

astrides = np.array(A.strides, dtype=np.int32) // A.itemsize
bstrides = np.array(B.strides, dtype=np.int32) // B.itemsize

func = mod.get_function("mat_ops")
func(A_k, astrides[0], astrides[1], B_k, bstrides[0], bstrides[1], grid=(4,1,1),block=(4,3,1))
print(B_k[:4,:4,:3])

Here I have chosen to make the source and destination arrays different sizes, just to show that the code is general and will work for any size arrays as long as the block size is sufficient.在这里我选择使源数组和目标数组的大小不同,只是为了表明代码是通用的,只要块大小足够,就可以用于任何大小的数组。 Note that there is no array bounds checking here on the device code side, you will need to add that for non-trivial examples.请注意,这里没有设备代码端的数组边界检查,您需要为非平凡的示例添加它。

Note also that this should work correctly both for fortran and C ordered numpy arrays, because it uses the numpy stride values directly. 还要注意,这对于 fortran 和 C 有序的 numpy 数组都应该正常工作,因为它直接使用 numpy stride 值。 Performance will be effected on the CUDA side because of memory coalescing issues, however. 但是,由于内存合并问题,CUDA 端的性能会受到影响。

Note: this won't work for both fortran and C ordering without extending the helper class to take strides for all dimensions and changing the kernel to accept strides for all dimensions of the input and output arrays.注意:如果不扩展辅助类以对所有维度进行 strides 并更改内核以接受输入和输出数组的所有维度的 strides,这将不适用于 fortran 和 C 排序。 From a performance perspective it would be better to write separate helper classes for fortran and C ordered arrays.从性能角度来看,最好为 fortran 和 C 有序数组编写单独的帮助程序类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM