Cuda Fortran 4D阵列

Question

My code is being slowed down by a my 4D arrays access in global memory. 我的代码因全局内存中的4D数组访问而变慢。

I am using PGI compiler 2010. 我正在使用PGI编译器2010。

The 4D array I am accessing is read only from the device and the size is known at run time. 我正在访问的4D阵列仅从设备读取，并且其大小在运行时已知。

I wanted to allocate to the texture memory and found that my PGI version does not support texture. 我想分配给纹理内存，发现我的PGI版本不支持纹理。 As the size is known only at run time, it is not possible to use constant memory too. 由于大小仅在运行时才知道，因此也无法使用常量内存。

Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input. 像MyFourD(100, x,y,z)这样的编译时，只有一个维度是已知的，其中x，y，z是用户输入。

My first idea is about pointers but not familiar with pointer fortran. 我的第一个想法是关于指针，但对指针fortran不熟悉。

If you have experience how to deal with such a situation, I will appreciate your help. 如果您有如何处理这种情况的经验，将不胜感激。 Because only this makes my codes 5times slower than expected 因为这只会使我的代码比预期慢5倍

Following is a sample code of what I am trying to do 以下是我正在尝试做的示例代码

int i,j,k

i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1

    do k = 0, 100 
        regvalue1 = somevalue1
        regvalue2 = somevalue2 
        regvalue3 =  somevalue3 

        d_value(i,j,k)=d_value(i,j,k)
     &     +myFourdArray(10,i,j,k)*regvalue1      
     &     +myFourdArray(32,i,j,k)*regvalue2      
     &     +myFourdArray(45,i,j,k)*regvalue3                    
    end do

Best regards, 最好的祝福，

Answer 1

I believe the answer from @Alexander Vogt is on the right track - I would think about re-ordering the array storage. 我相信@Alexander Vogt的答案是正确的-我会考虑重新排序阵列存储。 But I would try it like this: 但是我会这样尝试：

int i,j,k

i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1

    do k = 0, 100 
        regvalue1 = somevalue1
        regvalue2 = somevalue2 
        regvalue3 =  somevalue3 

        d_value(i,j,k)=d_value(i,j,k)
     &     +myFourdArray(i,j,k,10)*regvalue1      
     &     +myFourdArray(i,j,k,32)*regvalue2      
     &     +myFourdArray(i,j,k,45)*regvalue3                    
    end do

Note that the only change is to myFourdArray , there is no need for a change in data ordering in the d_value array. 请注意，唯一的变化是myFourdArray ，无需更改d_value数组中的数据顺序。

The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. 此更改的症结在于，我们允许相邻线程访问myFourdArray相邻元素，因此我们允许合并访问。 Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing. 您的原始公式会强制相邻的线程访问以第一维的长度分隔的元素，因此无法进行有用的合并。

Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. 无论是在CUDA C还是CUDA Fortran中，线程都将按X，Y和Z维度进行分组。 So the rapidly varying thread subscript is X first. 因此，快速变化的线程下标是X优先。 Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying. 因此，在矩阵访问中，我们希望此快速变化的下标显示在也快速变化的索引中。

In Fortran this index is the first of a multiple-subscripted array. 在Fortran中，此索引是多重下标数组的第一个。

In C, this index is the last of a multiple-subscripted array. 在C语言中，此索引是多下标数组的最后一个。

Your original code followed this convention for d_value by placing the X thread index ( i ) in the first array subscript position. 您的原始代码通过将X线程索引（ i ）放在第一个数组下标位置来d_value的约定。 But it broke this convention for myFourdArray by putting a constant in the first array subscript position. 但是它通过在第一个数组下标位置放置一个常量，打破了myFourdArray的约定。 Thus your access to myFourdArray are noticeably slower. 因此，您对myFourdArray的访问明显较慢。

When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (ie k , in this case, as Alexander Vogt did) because doing that will also break coalescing. 当代码中存在循环时，我们也不想首先将循环变量（对于Fortran或最后对于C）放置（即k ，在这种情况下，如Alexander Vogt所做的那样），因为这样做会破坏合并。 For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. 对于循环的每次迭代，我们都有多个线程以锁步的方式执行，这些线程应该都访问相邻的元素。 This is facilitated by having the X thread indexed subscript (eg i ) first (for Fortran, or last for C). 这是通过具有X线程索引标促进（例如i ）第一（对于Fortran或持续C）。

Answer 2

You could invert the indexing, ie let the first dimension change the Fastest. 您可以反转索引，即让第一个维度更改最快。 Fortran is column major ! Fortran是专栏专业！

do k = 0, 100 
    regvalue1 = somevalue1
    regvalue2 = somevalue2 
    regvalue3 =  somevalue3 

    d_value(k,i,j)=d_value(k,i,j) +         &
      myFourdArray(k,i,j,10)*regvalue1 +    &
      myFourdArray(k,i,j,32)*regvalue2 +    &
      myFourdArray(k,i,j,45)*regvalue3                   
end do

If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead. 如果最后一个（在原始情况下为第二个）维始终是固定的（并且不要太大），请考虑使用单个数组。

In my experience, pointers do not change much in terms of speed-up when applied to large arrays. 以我的经验，在应用于大型数组时，指针在加速方面不会有太大变化。 What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler. 您可以尝试在高速缓存访问方面通过条带采矿来优化循环，但是我不知道使用PGI编译器启用此功能的compile选项。

Ah, ok it is a simple directive : 嗯，好的，这是一个简单的指令：

!$acc do vector
do k=...
enddo

Cuda Fortran 4D阵列

问题描述

2 个解决方案

解决方案1
2 2013-09-23 13:44:07

解决方案2
1 2013-09-23 12:37:53

Cuda Fortran 4D阵列

问题描述

2 个解决方案

解决方案1 2 2013-09-23 13:44:07

解决方案2 1 2013-09-23 12:37:53

解决方案1
2 2013-09-23 13:44:07

解决方案2
1 2013-09-23 12:37:53