[英]Calculating distances between points using shared memory
I'm trying to calculate the distance (metric weighted) between all points. 我正在尝试计算所有点之间的距离(度量标准加权)。 To get a speed up, I am doing this on gpu and through cuda and numba since I think it's more readable and easier to use.
为了提高速度,我在gpu上以及通过cuda和numba进行此操作,因为我认为它更易读并且更易于使用。
I have two 1d arrays of 1d points and want to calculate the distance between all points in the same array and the distance between all points between both arrays. 我有两个1d点的1d数组,想要计算同一数组中所有点之间的距离以及两个数组之间所有点之间的距离。 I've written two cuda kernels, one just using global memory, which I have verified gives the correct answer using cpu code.
我已经编写了两个cuda内核,一个仅使用全局内存,我已验证它们使用cpu代码给出了正确的答案。 This is it.
就是这个。
@cuda.jit
def gpuSameSample(A,arrSum):
tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
temp = A[tx]
tempSum = 0.0
for i in range(tx+1,A.size):
distance = (temp - A[i])**2
tempSum += math.exp(-distance/sigma**2)
arrSum[tx] = tempSum
I am now trying to optimise this further by using shared memory. 我现在正在尝试通过使用共享内存进一步优化此功能。 This is what I have so far.
到目前为止,这就是我所拥有的。
@cuda.jit
def gpuSharedSameSample(A,arrSum):
#my block size is equal to 32
sA = cuda.shared.array(shape=(tpb),dtype=float32)
bpg = cuda.gridDim.x
tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
count = len(A)
#loop through block by block
tempSum = 0.0
#myPoint = A[tx]
if(tx < count):
myPoint = A[tx]
for currentBlock in range(bpg):
#load in a block to shared memory
copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
if(copyIdx < count):
sA[cuda.threadIdx.x] = A[copyIdx]
#syncthreads to ensure copying finishes first
cuda.syncthreads()
if((tx < count)):
for i in range(cuda.threadIdx.x,cuda.blockDim.x):
if(copyIdx != tx):
distance = (myPoint - sA[i])**2
tempSum += math.exp(-distance/sigma**2)
#syncthreads here to avoid race conditions if a thread finishes earlier
#arrSum[tx] += tempSum
cuda.syncthreads()
arrSum[tx] += tempSum
I believe I have been careful about syncing threads but this answer gives an answer which is always too large (by about 5%). 我相信我在同步线程时非常小心,但是这个答案给出的答案总是太大(大约5%)。 I'm guessing there must be some race condition, but as I understand it, each thread writes to a unique index and the
tempSum
variable is local to each thread so there shouldn't be any race condition. 我猜必须有一些竞争条件,但是据我了解,每个线程都会写入一个唯一的索引,并且
tempSum
变量对于每个线程都是本地的,因此不应有任何竞争条件。 I'm quite sure that my for loop conditions are correct. 我非常确定我的for循环条件正确。 Any suggestions would be greatly appreciated.
任何建议将不胜感激。 Thanks.
谢谢。
It's better if you provide a complete code. 最好提供完整的代码。 It should be straightforward to do this with trivial additions to what you have shown - just as I have done below.
只需对您所显示的内容进行一些琐碎的添加即可轻松做到这一点-就像我在下面所做的那样。 However there are differences between your two realizations even with a restrictive set of assumptions.
但是,即使有一组限制性假设,您的两个实现之间也存在差异。
I will assume that: 我将假设:
I'm also not going to try and comment on whether your shared realization makes sense, ie should be expected to perform better than the non-shared realization. 我也不会尝试评论您的共享实现是否有意义,即应该比非共享实现更好。 That doesn't seem to be the crux of your question, which is why are you getting a numerical difference between the 2 realizations.
这似乎不是您问题的症结所在,这就是为什么您在这两个实现之间得到了数值上的差异。
The primary issue is that your method for selecting which elements to compute the pairwise "distance" in each case is not matching. 主要问题是,每种情况下用于选择计算成对“距离”的元素的方法都不匹配。 In the non-shared realization, for every element
i
in your input data set, you are computing a sum of distances between i
and every element greater than i
: 在非共享实现中,对于输入数据集中的每个元素
i
,您正在计算i
与每个大于i
元素之间的距离之和:
for i in range(tx+1,A.size):
^^^^^^^^^^^
This selection of items to sum does not match the shared realization: 此总和的项目选择与共享实现不匹配:
for i in range(cuda.threadIdx.x,cuda.blockDim.x):
if(copyIdx != tx):
There are several issues here, but it should be plainly evident that for each block copied in, a given element at position threadIdx.x
is only updating its sum if the target element within the block (of data) is greater than that index. 这里有几个问题,但是很明显,对于复制的每个块,仅当(数据块中)目标元素大于该索引时,位置
threadIdx.x
处的给定元素才会更新其和。 That means as you go through the total data set block-wise, you will be skipping elements in each block . 这意味着,当您逐块浏览整个数据集时,您将跳过每个块中的元素。 That could not possibly match the non-shared realization.
这可能与非共享实现不匹配。 If this is not evident, just select actual values for the range of the for loop.
如果这不明显,则只需为for循环的范围选择实际值。 Suppose
cuda.threadIdx.x
is 5, and cuda.blockDim.x
is 32. Then that particular element will only compute a sum for items 6-31 in each block of data, throughout the array. 假设
cuda.threadIdx.x
为5,而cuda.blockDim.x
为32。则该特定元素将仅计算整个数组中每个数据块中项6-31的总和。
The solution to this problem is to force the shared realization to line up with the non-shared, in terms of how it is selecting elements to contribute to the running sum. 解决该问题的方法是,根据共享资源如何选择元素来增加总和,使共享资源与非共享资源对齐。
In addition, in the non-shared realization you are updating the output point only once, and you are doing a direct assignment: 另外,在非共享实现中,您仅更新输出点一次,并且您正在执行直接分配:
arrSum[tx] = tempSum
In the shared realization, you are still only updating the output point once, however you are not doing a direct assignment. 在共享实现中,您仍然只更新一次输出点,但是您没有进行直接分配。 I changed this to match the non-shared:
我将其更改为与非共享匹配:
arrSum[tx] += tempSum
Here is a complete code with those issues addressed: 这是解决这些问题的完整代码:
$ cat t49.py
from numba import cuda
import numpy as np
import math
import time
from numba import float32
sigma = np.float32(1.0)
tpb = 32
@cuda.jit
def gpuSharedSameSample(A,arrSum):
#my block size is equal to 32
sA = cuda.shared.array(shape=(tpb),dtype=float32)
bpg = cuda.gridDim.x
tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
count = len(A)
#loop through block by block
tempSum = 0.0
#myPoint = A[tx]
if(tx < count):
myPoint = A[tx]
for currentBlock in range(bpg):
#load in a block to shared memory
copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
if(copyIdx < count): #this should always be true
sA[cuda.threadIdx.x] = A[copyIdx]
#syncthreads to ensure copying finishes first
cuda.syncthreads()
if((tx < count)): #this should always be true
for i in range(cuda.blockDim.x):
if(copyIdx-cuda.threadIdx.x+i > tx):
distance = (myPoint - sA[i])**2
tempSum += math.exp(-distance/sigma**2)
#syncthreads here to avoid race conditions if a thread finishes earlier
#arrSum[tx] += tempSum
cuda.syncthreads()
arrSum[tx] = tempSum
@cuda.jit
def gpuSameSample(A,arrSum):
tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
temp = A[tx]
tempSum = 0.0
for i in range(tx+1,A.size):
distance = (temp - A[i])**2
tempSum += math.exp(-distance/sigma**2)
arrSum[tx] = tempSum
size = 128
threads_per_block = tpb
blocks = (size + (threads_per_block - 1)) // threads_per_block
my_in = np.ones( size, dtype=np.float32)
my_out = np.zeros(size, dtype=np.float32)
gpuSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
gpuSharedSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
$ python t49.py
[ 127. 126. 125. 124. 123. 122. 121. 120. 119. 118. 117. 116.
115. 114. 113. 112. 111. 110. 109. 108. 107. 106. 105. 104.
103. 102. 101. 100. 99. 98. 97. 96. 95. 94. 93. 92.
91. 90. 89. 88. 87. 86. 85. 84. 83. 82. 81. 80.
79. 78. 77. 76. 75. 74. 73. 72. 71. 70. 69. 68.
67. 66. 65. 64. 63. 62. 61. 60. 59. 58. 57. 56.
55. 54. 53. 52. 51. 50. 49. 48. 47. 46. 45. 44.
43. 42. 41. 40. 39. 38. 37. 36. 35. 34. 33. 32.
31. 30. 29. 28. 27. 26. 25. 24. 23. 22. 21. 20.
19. 18. 17. 16. 15. 14. 13. 12. 11. 10. 9. 8.
7. 6. 5. 4. 3. 2. 1. 0.]
[ 127. 126. 125. 124. 123. 122. 121. 120. 119. 118. 117. 116.
115. 114. 113. 112. 111. 110. 109. 108. 107. 106. 105. 104.
103. 102. 101. 100. 99. 98. 97. 96. 95. 94. 93. 92.
91. 90. 89. 88. 87. 86. 85. 84. 83. 82. 81. 80.
79. 78. 77. 76. 75. 74. 73. 72. 71. 70. 69. 68.
67. 66. 65. 64. 63. 62. 61. 60. 59. 58. 57. 56.
55. 54. 53. 52. 51. 50. 49. 48. 47. 46. 45. 44.
43. 42. 41. 40. 39. 38. 37. 36. 35. 34. 33. 32.
31. 30. 29. 28. 27. 26. 25. 24. 23. 22. 21. 20.
19. 18. 17. 16. 15. 14. 13. 12. 11. 10. 9. 8.
7. 6. 5. 4. 3. 2. 1. 0.]
$
Note that if either of my two assumptions are violated, your code has other issues. 请注意,如果违反了我的两个假设之一,则您的代码还有其他问题。
In the future, I encourage you to provide a short, complete code, as I have shown above. 就像我上面显示的那样,将来我鼓励您提供简短的完整代码。 For a question like this, it should not be much additional work.
对于这样的问题,应该没有太多的额外工作。 If for no other reason (and there are other reasons), its tedious to force others to write this code from scratch, when you already have it, so as to demonstrate the sensibility of the answer provided.
如果没有其他原因(也有其他原因),那么在您已经拥有它的情况下,迫使其他人从头开始编写此代码很繁琐,以证明所提供答案的敏感性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.