使用共享内存计算点之间的距离

Question

I'm trying to calculate the distance (metric weighted) between all points. 我正在尝试计算所有点之间的距离（度量标准加权）。 To get a speed up, I am doing this on gpu and through cuda and numba since I think it's more readable and easier to use. 为了提高速度，我在gpu上以及通过cuda和numba进行此操作，因为我认为它更易读并且更易于使用。

I have two 1d arrays of 1d points and want to calculate the distance between all points in the same array and the distance between all points between both arrays. 我有两个1d点的1d数组，想要计算同一数组中所有点之间的距离以及两个数组之间所有点之间的距离。 I've written two cuda kernels, one just using global memory, which I have verified gives the correct answer using cpu code. 我已经编写了两个cuda内核，一个仅使用全局内存，我已验证它们使用cpu代码给出了正确的答案。 This is it. 就是这个。

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

I am now trying to optimise this further by using shared memory. 我现在正在尝试通过使用共享内存进一步优化此功能。 This is what I have so far. 到目前为止，这就是我所拥有的。

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                                                                           
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                                                            
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                                                                        

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                                                                   
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count):
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                                                                       
            cuda.syncthreads()


            if((tx < count)):
                for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                    if(copyIdx != tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                                                             
            #arrSum[tx] += tempSum                                                                                                                                                                          
            cuda.syncthreads()
    arrSum[tx] += tempSum

I believe I have been careful about syncing threads but this answer gives an answer which is always too large (by about 5%). 我相信我在同步线程时非常小心，但是这个答案给出的答案总是太大（大约5％）。 I'm guessing there must be some race condition, but as I understand it, each thread writes to a unique index and the tempSum variable is local to each thread so there shouldn't be any race condition. 我猜必须有一些竞争条件，但是据我了解，每个线程都会写入一个唯一的索引，并且tempSum变量对于每个线程都是本地的，因此不应有任何竞争条件。 I'm quite sure that my for loop conditions are correct. 我非常确定我的for循环条件正确。 Any suggestions would be greatly appreciated. 任何建议将不胜感激。 Thanks. 谢谢。

Answer 1

It's better if you provide a complete code. 最好提供完整的代码。 It should be straightforward to do this with trivial additions to what you have shown - just as I have done below. 只需对您所显示的内容进行一些琐碎的添加即可轻松做到这一点-就像我在下面所做的那样。 However there are differences between your two realizations even with a restrictive set of assumptions. 但是，即使有一组限制性假设，您的两个实现之间也存在差异。

I will assume that: 我将假设：

Your overall data set size is a whole number multiple of the size of your threadblock. 您的整体数据集大小是线程块大小的整数倍。
You are launching exactly as many threads in total as the size of your data set. 您启动的线程总数与数据集的大小完全一样。

I'm also not going to try and comment on whether your shared realization makes sense, ie should be expected to perform better than the non-shared realization. 我也不会尝试评论您的共享实现是否有意义，即应该比非共享实现更好。 That doesn't seem to be the crux of your question, which is why are you getting a numerical difference between the 2 realizations. 这似乎不是您问题的症结所在，这就是为什么您在这两个实现之间得到了数值上的差异。

The primary issue is that your method for selecting which elements to compute the pairwise "distance" in each case is not matching. 主要问题是，每种情况下用于选择计算成对“距离”的元素的方法都不匹配。 In the non-shared realization, for every element i in your input data set, you are computing a sum of distances between i and every element greater than i : 在非共享实现中，对于输入数据集中的每个元素i ，您正在计算i与每个大于i元素之间的距离之和：

for i in range(tx+1,A.size):
               ^^^^^^^^^^^

This selection of items to sum does not match the shared realization: 此总和的项目选择与共享实现不匹配：

            for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                if(copyIdx != tx):

There are several issues here, but it should be plainly evident that for each block copied in, a given element at position threadIdx.x is only updating its sum if the target element within the block (of data) is greater than that index. 这里有几个问题，但是很明显，对于复制的每个块，仅当（数据块中）目标元素大于该索引时，位置threadIdx.x处的给定元素才会更新其和。 That means as you go through the total data set block-wise, you will be skipping elements in each block . 这意味着，当您逐块浏览整个数据集时，您将跳过每个块中的元素。 That could not possibly match the non-shared realization. 这可能与非共享实现不匹配。 If this is not evident, just select actual values for the range of the for loop. 如果这不明显，则只需为for循环的范围选择实际值。 Suppose cuda.threadIdx.x is 5, and cuda.blockDim.x is 32. Then that particular element will only compute a sum for items 6-31 in each block of data, throughout the array. 假设cuda.threadIdx.x为5，而cuda.blockDim.x为32。则该特定元素将仅计算整个数组中每个数据块中项6-31的总和。

The solution to this problem is to force the shared realization to line up with the non-shared, in terms of how it is selecting elements to contribute to the running sum. 解决该问题的方法是，根据共享资源如何选择元素来增加总和，使共享资源与非共享资源对齐。

In addition, in the non-shared realization you are updating the output point only once, and you are doing a direct assignment: 另外，在非共享实现中，您仅更新输出点一次，并且您正在执行直接分配：

arrSum[tx] = tempSum

In the shared realization, you are still only updating the output point once, however you are not doing a direct assignment. 在共享实现中，您仍然只更新一次输出点，但是您没有进行直接分配。 I changed this to match the non-shared: 我将其更改为与非共享匹配：

arrSum[tx] += tempSum

Here is a complete code with those issues addressed: 这是解决这些问题的完整代码：

$ cat t49.py
from numba import cuda
import numpy as np
import math
import time
from numba import float32

sigma = np.float32(1.0)
tpb = 32

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                               
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                            

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                           
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count): #this should always be true
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                           
            cuda.syncthreads()


            if((tx < count)):    #this should always be true
                for i in range(cuda.blockDim.x):
                    if(copyIdx-cuda.threadIdx.x+i > tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                        
            #arrSum[tx] += tempSum                                                                                                                              
            cuda.syncthreads()
    arrSum[tx] = tempSum

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

size = 128
threads_per_block = tpb
blocks = (size + (threads_per_block - 1)) // threads_per_block
my_in  = np.ones( size, dtype=np.float32)
my_out = np.zeros(size, dtype=np.float32)
gpuSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
gpuSharedSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
$ python t49.py
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
$

Note that if either of my two assumptions are violated, your code has other issues. 请注意，如果违反了我的两个假设之一，则您的代码还有其他问题。

In the future, I encourage you to provide a short, complete code, as I have shown above. 就像我上面显示的那样，将来我鼓励您提供简短的完整代码。 For a question like this, it should not be much additional work. 对于这样的问题，应该没有太多的额外工作。 If for no other reason (and there are other reasons), its tedious to force others to write this code from scratch, when you already have it, so as to demonstrate the sensibility of the answer provided. 如果没有其他原因（也有其他原因），那么在您已经拥有它的情况下，迫使其他人从头开始编写此代码很繁琐，以证明所提供答案的敏感性。

使用共享内存计算点之间的距离

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-07-28 21:34:13

使用共享内存计算点之间的距离

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-07-28 21:34:13

解决方案1
2 已采纳 2019-07-28 21:34:13