[英]Matmul in OpenACC Fortran loop
Accelerating a Fortran code with OpenACC using the PGI compiler, I got problems with a matmul
call in an accelerated loop.使用 PGI 编译器通过 OpenACC 加速 Fortran 代码,我在加速循环中遇到了matmul
调用问题。
In the simplified example, I apply the identity matrix on two vectors, so the input and the output values should be the same:在简化示例中,我将单位矩阵应用于两个向量,因此输入和 output 值应该相同:
program test
implicit none
integer :: a(3, 3)
integer :: v1(3, 2), v2(3, 2)
integer :: i
a = reshape([1, 0, 0, 0, 1, 0, 0, 0, 1], [3, 3])
v1 = reshape([1, 2, 3, 4, 5, 6], [3, 2])
print *, v1
!$acc kernels copyin(a, v1) copyout(v2)
!$acc loop independent
do i = 1, 2
v2(:, i) = matmul(a, v1(:, i))
enddo
!$acc end kernels
print *, v2
endprogram
When compiling with the PGI compiler version 20.9, I got these information:使用 PGI 编译器 20.9 版进行编译时,我得到了以下信息:
test:
12, Generating copyin(a(:,:),v1(:,:)) [if not already present]
Generating implicit copyout(z_a_0(:)) [if not already present]
Generating copyout(v2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
15, Loop is parallelizable
Running the code gives the following values:运行代码会给出以下值:
1 2 3 4 5 6
4 5 6 4 5 6
the second line should be like the first one, which is the case on sequential execution.第二行应该和第一行一样,顺序执行就是这种情况。 What is wrong in the code?代码有什么问题?
Looks to be a compiler issue.看起来是编译器问题。 The problem line being:问题线是:
Generating implicit copyout(z_a_0(:))
"z_a_0" is compiler temp array being created to hold the intermediary result from the call to matmul. “z_a_0”是创建的编译器临时数组,用于保存调用 matmul 的中间结果。 It's declaration is being hoisted out of the loop and then copied back in as shared array.它的声明被提升出循环,然后作为共享数组复制回来。 Since it's shared, it then causes a race condition.由于它是共享的,因此会导致竞争条件。
I've submitted a problem report (TPR #29482) and sent it to our engineers for further investigation.我已提交问题报告 (TPR #29482) 并将其发送给我们的工程师以进行进一步调查。
@Mat Colgrove explained the reason of the incorrect behavior. @Mat Colgrove 解释了错误行为的原因。 The workaround I found was to write the matrix vector multiplication explicitly:我发现的解决方法是显式编写矩阵向量乘法:
program test
implicit none
integer :: a(3, 3)
integer :: v1(3, 2), v2(3, 2)
integer :: i, j, k
a = reshape([1, 0, 0, 0, 1, 0, 0, 0, 1], [3, 3])
v1 = reshape([1, 2, 3, 4, 5, 6], [3, 2])
print *, v1
!$acc kernels copyin(a, v1) copyout(v2)
!$acc loop independent
do i = 1, 2
!$acc loop seq
do k = 1, 3
v2(k, i) = 0
!$acc loop seq
do j = 1, 3
v2(k, i) = v2(k, i) + a(j, k) * v1(j, i)
enddo
enddo
enddo
!$acc end kernels
print *, v2
endprogram
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.