[英]Optimized CUDA matrix hamming distance
Is anyone aware of an optimized CUDA kernel for computing a GEMM style hamming distance between two matrices of dimension A x N and N x B? 是否有人知道优化的CUDA内核用于计算尺寸为A x N和N x B的两个矩阵之间的GEMM样式汉明距离? The problem is nearly identical to GEMM, but instead computes the sum( a_n != b_n ) for each vector {1 ... N}, instead of multiplying and summing each vector element.
问题几乎与GEMM相同,而是计算每个向量{1 ... N}的和(a_n!= b_n),而不是对每个向量元素进行乘法和求和。
I wanted to verify before writing my own, since this problem is relatively common, but I haven't had success in finding code for it yet. 我想在编写自己的之前验证,因为这个问题比较常见,但我还没有成功找到它的代码。 Suggestions for code to modify would be excellent as well.
修改代码的建议也很好。
EDIT: 编辑:
In addition to kangshiyin's suggestions below, I found this walk-through of an optimized SGEMM implementation to be extraordinarily helpful in understanding steps beyond the basic shared memory matrix multiplication example in the CUDA C Programming Guide. 除了下面的kangshiyin的建议之外,我发现这个优化的SGEMM实现的演绎对于理解CUDA C编程指南中基本共享内存矩阵乘法示例之外的步骤非常有帮助。
You are right that you could write your kernel by modifying gemm()
code. 你可以通过修改
gemm()
代码来编写内核。 CUDA examples have a simple implementation of gemm()
, but it is too simple. CUDA示例有一个简单的
gemm()
,但它太简单了。 The performance is bounded by shared memory access, giving only ~250 Gflops on Kepler devices. 性能受共享内存访问的限制,在Kepler设备上仅提供约250 Gflops。 For higher performance, you may want to check the
gemm()
code in MAGMA. 为了获得更高的性能,您可能需要检查MAGMA中的
gemm()
代码。
http://icl.cs.utk.edu/magma/index.html http://icl.cs.utk.edu/magma/index.html
These two papers also tell you how to implement and tune gemm()
. 这两篇论文还告诉你如何实现和调整
gemm()
。
http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Papers/Autotuning/Autotuning%2520GEMM%2520Kernels%2520for%2520the%2520Fermi%2520GPU.pdf http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Papers/Autotuning/Autotuning%2520GEMM%2520Kernels%2520for%2520the%2520Fermi%2520GPU.pdf
http://www.netlib.org/lapack/lawnspdf/lawn267.pdf http://www.netlib.org/lapack/lawnspdf/lawn267.pdf
Unlike gemm()
which has hardware support with the FMA instruction for fast multiply-and-add operation, your desired operation compare-and-add may need more instructions, thus the performance should be lower. 与具有FMA指令的硬件支持的
gemm()
用于快速乘法和加法操作不同,您所需的操作比较和添加可能需要更多指令,因此性能应该更低。 Considering the peak performance of gemm()
is ~3 Tflops on Kepler. 考虑到开普勒的
gemm()
峰值性能约为3 Tflops。 You may be able to get 0.5~2 Tflops for hamming distance matrix calculation. 汉明距离矩阵计算可能会得到0.5~2个Tflops。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.