简体   繁体   English

VexCL中的密集矩阵向量乘法

[英]Dense Matrix-vector multiplication in VexCL

VexCL seems to be a very attractive library for gpu programming. VexCL似乎是一个非常有吸引力的gpu编程库。

Unfortunately, its a very young library and there are few information over there. 不幸的是,它是一个非常年轻的图书馆,那里的信息很少。 I've been searching how to execute a matrix-vector multiplication, but the only matrix representation I've found is vex::SpMat, which holds a sparse matrix. 我一直在寻找如何执行矩阵向量乘法,但我发现的唯一矩阵表示是vex :: SpMat,它包含一个稀疏矩阵。

If matrix is dense, then the sparse representations are less efficient for computations, usually. 如果矩阵是密集的,那么稀疏表示通常对计算的效率较低。

All my matrix are dense and I want to know how to execute this efficiently in VexCL. 我的所有矩阵都很密集,我想知道如何在VexCL中有效地执行它。

I am the developer of VexCL library. 我是VexCL库的开发人员。

I have to admit dense linear algebra operations are not on my priority list. 我必须承认密集线性代数操作不在我的优先级列表中。 I believe it is very hard to implement those in a way that would be performance-portable across various devices supported by VexCL (that is, by OpenCL/CUDA). 我认为很难以VexCL支持的各种设备(即OpenCL / CUDA)的性能可移植方式实现这些。 This task is probably best left to the vendor BLAS implementations (but patches are welcome!). 这个任务最好留给供应商BLAS实现(但欢迎补丁!)。

You may also want to look at the opensource ViennaCL library, which does provide dense matrix operations and supports OpenCL, CUDA, and OpenMP backends. 您可能还想查看opensource ViennaCL库,它提供密集矩阵操作并支持OpenCL,CUDA和OpenMP后端。 Their autotuning framework allows them to get portable performance which is close to vendor-tuned libraries. 他们的自动调整框架允许他们获得与供应商调优的库接近的便携性能。

Having said that, you have a couple of options (aside from providing a custom kernel) for the dense matrix - vector product in VexCL. 话虽如此,您还有一些选项(除了提供自定义内核)用于VexCL中的密集矩阵 - 矢量产品。 First, you may use direct implementation based on definition of matrix-vector product: 首先,您可以使用基于矩阵向量乘积定义的直接实现:

using namespace vex;
Context ctx(Filter::Env && Filter::Count(1));

// The n x m matrix stored row-wise.
vector<double> A(ctx, n * m);
// The LHS and RHS vectors.
vector<double> x(ctx, m);
vector<double> y(ctx, n);

// Make an n x m matrix from vector x by replicating it along the first
// dimension (reshape), multiply it elementwise by A, and reduce the result
// along the second dimension.
// In other words, y_i = sum_j (A_ij * x_j)
y = reduce<SUM>(
        extents[n][m],  // Shape of the expression to reduce,
        A * reshape(
                x,
                extents[n][m], // (We need an n x m matrix...
                extents[1]     // ... but we only have vector of size m).
            ),          // the expression,
        1               // and the dimension to reduce along.
        );

With C++14 this could be easily hidden away into a function call: 使用C ++ 14,可以很容易地将其隐藏到函数调用中:

template <class M, class V>
auto prod(size_t n, size_t m, M &&A, V &&x) {
    using namespace vex;
    auto NxM = extents[n][m];
    return reduce<SUM>(NxM, A * reshape(x, NxM, extents[1]), 1);
}

Second, you may just use vendor specific library. 其次,您可以只使用供应商特定的库。 For example, if you use CUDA backend with VexCL, you could get raw pointers to VexCL-allocated memory regions and call cuBLAS gemv : 例如,如果您将CUDA后端与VexCL一起使用,则可以获得指向Ve​​xCL分配的内存区域的原始指针并调用cuBLAS gemv

double one  = 1;
double zero = 0;
cublasDgemv(
        cublas_handle, CUBPLAS_OP_N, n, m,
        &zero,
        A(0).raw_ptr(), m,
        x(0).raw_ptr(), 1
        &one,
        y(0).raw_ptr(), 1
        );

The first approach should be less effective than a call to cuBLAS. 第一种方法应该不如调用cuBLAS有效。 Its advantage is that the result of reduce() call is a vector expression and you could in principle combine several of those into a single fused compute kernel. 它的优点是reduce()调用的结果是一个向量表达式,原则上你可以将其中的几个组合成一个融合的计算内核。 For example, you could compute Ax + By , or sin(Ax) + cos(By) , or (A + B)(x - y) , or any other vector expression in a single kernel call: 例如,您可以在单个内核调用中计算Ax + Bysin(Ax) + cos(By) ,或(A + B)(x - y)或任何其他向量表达式:

z = prod(n, m, A, x) + prod(n, m, B, y);
z = sin(prod(n, m, A, x)) + cos(prod(n, m, B, y));
z = prod(n, m, A + B, x - y);

This could be more effective than several chained cuBLAS calls. 这可能比几个链式cuBLAS调用更有效。 I have examples where VexCL outperforms cuBLAS by a factor of 1.5 due to this. 我有一些例子 ,其中VexCL因此而优于cuBLAS 1.5倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM