[英]Distribute the threads between blocks in CUDA
I'm working on a project in CUDA. 我正在CUDA中进行项目。 The first time I used only one block with
Dim 8*8
as my matrix. 第一次我仅使用
Dim 8*8
作为矩阵的一个块。 And then I calculated the index as follows: 然后我按如下方式计算索引:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
And it gave me a correct answer. 它给了我正确的答案。 After that I want to distribute the threads between blocks to measure the performance.
之后,我想在模块之间分配线程以测量性能。 I make the grid dim to be (2,1) and the block dim to be (4,8).
我将网格调暗为(2,1),将块调暗为(4,8)。
When I debug the code by hand, it seems to give me the correct index without changing the formula mentioned above. 当我手动调试代码时,似乎可以在不更改上述公式的情况下为我提供正确的索引。 But when I run the program, the screen hangs and the results are all zero.
但是当我运行该程序时,屏幕挂起,结果全为零。
What did I do wrong, and how can I fix this? 我做错了什么,该如何解决?
This is the kernel function 这是内核功能
__global__ void cover_fault(int *a,int *b, int *c, int *d, int *mulFV1, int *mulFV2, int *checkDalU1, int *checkDalU2, int N)
{
//Fig.2
__shared__ int f[9][9];
__shared__ int compV1[9],compV2[9];
int dalU1[9] , dalU2[9];
int Ra=2 , Ca=2;
for (int i = 0 ; i < N ; i++)
for (int j = 0 ; j < N ; j++)
f[i][j]=0;
f[3][0] = 1;
f[0][2] = 1;
f[0][6] = 1;
f[3][7] = 1;
f[2][4] = 1;
f[6][4] = 1;
f[7][1] = 1;
int t =0 ,A = 1,B = 1 , UTP = 5 , LTP = -5 , U_max = 40 , U_min = -160;
bool flag = true;
int sumV1, sumV2;
int checkZero1 , checkZero2;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
while ( flag == true)
{
if ( c[idy] == 0 )
compV1[idy] = 1;
else if ( c[idy]==1)
compV1[idy] = 0 ;
if ( d[idy] == 0 )
compV2[idy] = 1;
else if ( d[idy]==1 )
compV2[idy] = 0 ;
sumV1 = reduce ( c, N );
sumV2 = reduce ( d, N );
if (idx<N && idy <N)
{
if(idx==0)
mulFV1[idy]=0;
if(idy==0)
mulFV2[idx]=0;
__syncthreads();
atomicAdd(&(mulFV1[idy]),f[idy][idx]*compV2[idx]);
atomicAdd(&(mulFV2[idx]),f[idy][idx]*compV1[idy]);
}
dalU1[idy] = ( -1*A*( sumV1 - Ra )) + (B * mulFV1[idy] * compV1[idy]) ;
dalU2[idy] = ( -1*A*( sumV2 - Ca )) + (B * mulFV2[idy] * compV2[idy]) ;
a[idy] = a[idy] + dalU1[idy];
b[idy] = b[idy] + dalU2[idy];
if ( a[idy] > U_max )
a[idy] = U_max;
else
if (a[idy] < U_min )
a[idy] = U_min;
if ( b[idy] > U_max )
b[idy] = U_max;
else
if (b[idy] < U_min )
b[idy] = U_min;
if (dalU1[idy]==0)
checkDalU1[idy]=0;
else
checkDalU1[idy]=1;
if (dalU2[idy]==0)
checkDalU2[idy]=0;
else
checkDalU2[idy]=1;
__syncthreads();
checkZero1 = reduce(checkDalU1,N);
checkZero2 = reduce(checkDalU2,N);
if ( checkZero1==0 && checkZero2==0)
flag = false;
else
{
if ( a[idy] > UTP )
c[idy] = 1;
else
if ( a[idy] < LTP )
c[idy] = 0 ;
if ( b[idy] > UTP )
d[idy] = 1;
else
if ( b[idy] < LTP )
d[idy] = 0 ;
t++;
}//end else
sumV1=0;
sumV2=0;
mulFV1[idy]=0;
mulFV2[idy]=0;
} //end while
}//end function
In your index computation, idx
will give you the column index and idy
the row index. 在索引计算中,
idx
将为您提供列索引,而idy
为行索引。 Are you accessing your matrix as M[idy][idx]
? 您是否以
M[idy][idx]
访问矩阵?
The cuda threads are organized according to the orthogonal system: X is horizontal and Y is vertical. cuda螺纹根据正交系统进行组织:X为水平,Y为垂直。 So if you say the point M[0][1] in the actual matrix it's M[1][0].
因此,如果您说实际矩阵中的点M [0] [1]是M [1] [0]。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.