简体   繁体   English

Arrayfire 矢量化

[英]Arrayfire Vectorization

I'm trying to speed up the following calculations but have not been able to reach the desired speed.我正在尝试加快以下计算,但未能达到所需的速度。 Im sure the issue is with my code and not physical limitations of the GPU.我确定问题出在我的代码上,而不是 GPU 的物理限制。

I have a matrix V that is 10,000 x 6 x 6. And another matrix P that is 6 x 1,000我有一个矩阵 V 是 10,000 x 6 x 6。另一个矩阵 P 是 6 x 1,000

Both complex两者都复杂

I need to do V * P (which should results in 10,000 x 6 x 1000) Take the magnitude (or mag sq) of it and then sum in the 6 dimension.我需要做 V * P (这应该导致 10,000 x 6 x 1000)取它的大小(或 mag sq),然后在 6 维中求和。 resulting in a 10,000 x 1000 of real values.产生 10,000 x 1000 的真实值。

I have tried the following:我尝试了以下方法:

af::array V{ 10000, 6, 6, c32 };
af::array P{ 6, 1000, c32 };
af::array VP = af::matmul(V, P); (results in 10,000x1000x6 - ok, as long as i still sum in the 6 dim)
af::array res = af::sum(af::abs(VP),2);

This was not nealy fast enough.这还不够快。 Then I tried converting V into an array, so I had:然后我尝试将 V 转换为数组,所以我有:

af::array V[6] = { af::array{ 10000, 6, c32 },
            af::array{ 10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
                    10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
                    10000, 6, c32 } };
af::array VP[6];
af::array res;
for (int i = 0; i < 6; i++)
{
    VP[i] = af::matmul(V[i], P);
}
res= af::abs(mCalledData[0]);

for (int i = 1; i < 6; i++)
{
    res+= af::abs(VP[i]);
}

This had about a 2x speedup.这有大约 2 倍的加速。 I came up with another solution but af::matmult that takes in 3 arrays doesn't support options (like hermitian) and doesn't support gfor, so I couldn't try that route.我想出了另一个解决方案,但 af::matmult 接受 3 个数组不支持选项(如 Hermitian)并且不支持 gfor,所以我无法尝试该路线。

Currently, the matrix multiply (in both approaches) takes about 2.2ms and it looks like arrayfire can combine the abs and sum into one JIT kernel that takes about 2 ms.目前,矩阵乘法(在两种方法中)大约需要 2.2 毫秒,看起来 arrayfire 可以将 abs 和 sum 组合成一个大约需要 2 毫秒的 JIT 内核。

My knowledge of arrayfire is limited so i'm guessing there is something I'm not thinking of.我对 arrayfire 的了解有限,所以我猜有些事情我没有想到。 Does anyone have an idea of how I can increase the speed of this algorithm?有没有人知道我如何提高这个算法的速度?

Thank you!谢谢!

I can confirm your findings that looped version is about twice as fast as the batched matmul.我可以确认您的发现,循环版本的速度大约是批处理 matmul 的两倍。 Matmul on its own is not essentially the one taking long runtime in your code snippet, it is the other operation of summing up along third dimension after abs which is costly. Matmul 本身并不是在您的代码片段中运行很长时间的操作,它是另一种在 abs 之后沿第三维求和的操作,成本很高。 It is due to the following reasons.这是由于以下原因。

1) sum(abs(result)) - abs is again not issue here. 1) sum(abs(result)) - abs 在这里再次不是问题。 Sum is reduction algorithm, which are usually quite fast along the fast moving dimension. Sum 是归约算法,通常沿着快速移动的维度非常快。 However, reduction along higher dimension the element stride is size of the matrix for successive elements.然而,沿更高维度减少元素步幅是连续元素矩阵的大小。 This expensive compared to reduction along continuous locations.与沿连续位置的减少相比,这很昂贵。

2) looped abs additions - This version is however is accessing elements that continuous in memory because, we are basically adding respective elements of 6 matrices. 2) looped abs additions - 然而,这个版本正在访问内存中连续的元素,因为我们基本上是添加 6 个矩阵的各个元素。 On top of this, the entire loop (along with abs OP) will be converted into a single JIT kernel that does the following which is very efficient.最重要的是,整个循环(连同 abs OP)将被转换为单个 JIT 内核,该内核执行以下非常有效的操作。

res = res + ptr0[i] + ptr1[i] + ptr2[i] + ptr0[i] + ptr1[i]

Above line is just for illustration, that is not the exact JIT kernel.上面这行只是为了说明,那不是确切的 JIT 内核。

Hence, the batched version is faster than looped version in this specific case because of the reduction operation that is being done on the result of matmul.因此,在这种特定情况下,批处理版本比循环版本更快,因为对 matmul 的结果进行了归约操作。

My test GPU: GTX 1060我的测试 GPU:GTX 1060

The matmul itself for a single [10k x 6] * [6 x 1k] is about half a millisecond on GTX 1060. Six such matmuls can't be done under millisecond on my GTX 1060 at least I would think.在 GTX 1060 上,单个[10k x 6] * [6 x 1k]的 matmul 本身大约是半毫秒。至少我认为在我的 GTX 1060 上无法在毫秒内完成六个这样的 matmul。 What is your target runtime ?你的目标运行时是什么?

EDITED (Jan 10, 2020): - Actually, This won't work because of abs operation on result of each matmul.已编辑(2020 年 1 月 10 日): -实际上,由于对每个 matmul 的结果进行abs操作,这将不起作用。

You can try looking into our latest entry into gemm category in master branch of ArrayFire.您可以尝试在 ArrayFire 的 master 分支中查看我们最新的 gemm 类别条目。 However, you would have to build arrayfire from source until our next feature release 3.7.但是,在我们的下一个功能版本 3.7 之前,您必须从源代码构建 arrayfire。 You can look at the documentation at the following page.您可以查看下一页的文档。

https://github.com/arrayfire/arrayfire/blob/master/include/af/blas.h#L230 https://github.com/arrayfire/arrayfire/blob/master/include/af/blas.h#L230

It follows the principle of Carray from cuBLAS gemm API .它遵循cuBLAS gemm APICarray原理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM