简体   繁体   English

如何在CUDA Fortran中分配共享内存数组?

[英]How to allocate a shared memory array in CUDA Fortran?

I'm having trouble trying to declare a shared memory array within the kernel. 我在尝试在内核中声明共享内存数组时遇到了麻烦。 Here's the code containing my kernel: 这是包含我的内核的代码:

module my_kernels

  use cudafor
  implicit none

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(in), value :: N
    real,shared,dimension(N) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

  end subroutine mykernel

end module my_kernels

And here's how I call my kernel: 以下是我调用内核的方法:

program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3>>>(N)

end program cuda

All of this I have in one file, test.cuf. 所有这些我都在一个文件test.cuf中。 When I try to compile test.cuf with pgf90, I get this error: 当我尝试使用pgf90编译test.cuf时,我收到此错误:

PGF90-S-0000-Internal compiler error. unexpected runtime function call       0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure     349 (test.cuf: 34)
  0 inform,   0 warnings,   2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"

/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression

2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted

In this case line 34 refers to end subroutine mykernel . 在这种情况下,第34行指的是end subroutine mykernel The compiler error is not very helpful, it took me a while to find out that the problem was to do with the shared array (I'm using this code as a simple example). 编译器错误不是很有用,我花了一段时间才发现问题与共享数组有关(我使用这段代码作为一个简单的例子)。

When I replace 'N' with '9' in the declaration of the shared array such that real,shared,dimension(N) :: shared_array is replaced with real,shared,dimension(9) :: shared_array , the error goes away. 当我在共享数组的声明中将'N'替换为'9'时,用real,shared,dimension(9) :: shared_array替换了real,shared,dimension(9) :: shared_array real,shared,dimension(N) :: shared_array real,shared,dimension(9) :: shared_array ,错误消失了。

My question is, why is this error occurring, and how do I set the dimension of a shared array with a variable (if indeed its possible)? 我的问题是,为什么会出现这个错误,以及如何使用变量设置共享数组的维度(如果确实可能的话)?

You can have more than one shared memory array, but their size must be known at compile time. 您可以有多个共享内存阵列,但是在编译时必须知道它们的大小。 In general shared memory arrays should be of fixed size, the case where you can pass the size in bytes at runtime is kind of exceptional. 通常,共享内存阵列应为固定大小,在运行时可以按字节传递大小的情况是一种例外。 I guess this is all due to the limitation on shared memory in the SM (Stream Multiprocessor). 我想这都是由于SM(流多处理器)中共享内存的限制所致。 In my experience developing in both CUDA C and CUDA fortran is better to have all these parameters "fixed" and then make the kernel repeat the work as many times as needed to cover all input data, that way i easier to control all the paarmeters that affect the occupancy (how well you use all the physical resources in the GPU). 根据我的经验,在CUDA C和CUDA中开发fortran最好让所有这些参数“固定”,然后让内核重复工作所需的次数以覆盖所有输入数据,这样我就更容易控制所有的paarmeters影响占用率(您使用GPU中的所有物理资源的程度)。

Change "dimension(N)" to "dimension(*)" and then pass in the size of shared array (in bytes) as the third argument of your kernel launch. 将“dimension(N)”更改为“dimension(*)”,然后传入共享数组的大小(以字节为单位)作为内核启动的第三个参数。

Hope this helps, 希望这可以帮助,

Mat

% cat test.cuf 
module my_kernels

  use cudafor
  implicit none

  real, dimension(:), allocatable,device :: Ad
  real, dimension(:),allocatable :: Ah

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(IN), value :: N
    real,shared,dimension(*) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

    Ad(index) = index

  end subroutine mykernel

end module my_kernels


program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

   allocate(Ad(N), Ah(N))

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3,N*4>>>(N)

  Ah=Ad
  print *, Ah

end program cuda

% pgf90 test.cuf -V10.9 ; a.out
    1.000000        2.000000        3.000000        4.000000     
    5.000000        6.000000        7.000000        8.000000     
    9.000000 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM