简体   繁体   English

如何利用块/网格和线程/块?

[英]How to leverage blocks/grid and threads/block?

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.我正在尝试使用 CUDA 加速这个数据库搜索应用程序,并且我正在努力与 CUDA 并行运行核心算法。

In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.在一项测试中,我在一个大小为 5000 的数字序列上并行运行该算法,每个网格有 500 个块,每个块有 100 个线程,并以大约 500 毫秒的运行时间返回。

Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.然后我将数字序列的大小增加到 8192,每个网格有 128 个块,每个块有 64 个线程,并且以某种方式返回了 350 毫秒的结果来运行算法。

This would indicate that how many blocks and threads used and how they're related does impact performance.这将表明使用了多少块和线程以及它们如何相关确实会影响性能。

My question is how to decide the number of blocks/grid and threads/block?我的问题是如何确定块/网格和线程/块的数量?

Below I have my GPU specs from a standard device query program:下面我有来自标准设备查询程序的 GPU 规格:在此处输入图像描述

You should test it because it depends on your particular kernel.您应该对其进行测试,因为它取决于您特定的 kernel。 One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp.您必须要做的一件事是使每个块的线程数成为扭曲中线程数的倍数。 After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance.之后,您可以瞄准每个 SM 的高占用率,但这并不总是更高性能的代名词。 It was been shown that sometimes lower occupancy can give better performance.结果表明,有时较低的占用率可以提供更好的性能。 Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Memory 绑定内核通常受益于更高的占用率,以隐藏 memory 延迟。 Compute bound kernels not so much.计算绑定内核不是很多。 Testing the various configurations is your best bet.测试各种配置是您最好的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM