简体   繁体   English

如何优化Conway对CUDA的生活游戏?

[英]How to optimize Conway's game of life for CUDA?

I've written this CUDA kernel for Conway's game of life: 我为Conway的生活游戏编写了这个CUDA内核:

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

I am looking for errors/optimizations. 我正在寻找错误/优化。 Parallel programming is quite new to me and I am not sure if I get how to do it right. 并行编程对我来说很新,我不确定我是否能够正确地完成它。

The rest is a memcpy from an input array to the 2D texture inputTex bound to a CUDA array. 其余的是从输入数组到绑定到CUDA数组的2D纹理inputTex的memcpy。 Output is memcpy-ed from global memory to host and then dealt with. 输出从全局内存到主机进行memcpy-ed然后处理。

As you can see a thread deals with a single pixel. 正如您所看到的,线程处理单个像素。 I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. 我不确定这是否是最快的方式,因为一些消息来源建议每个线程执行一行或更多。 If I understand correctly NVidia themselves say that the more threads, the better. 如果我理解正确NVidia自己说越多线程越好。 I would love advice on this from someone with practical experience. 我很乐意从有实际经验的人那里得到建议。

My two cents. 我的两分钱。

The whole thing looks likely to be bounded by the latency of communication between multiprocessors and the GPU memory. 整个事情看起来很可能受到多处理器和GPU内存之间通信延迟的限制。 You have code that should take something like 30-50 clock ticks to execute on its own, and it generates at least 3 memory accesses which take 200+ clock ticks each if the requisite data is not in the cache. 您的代码应该采用30-50个时钟周期来自行执行,并且它生成至少3个内存访问,如果必需数据不在缓存中,则每个内存访问需要200多个时钟周期。

Using texture memory is a good way to address that, but it is not necessarily the optimal way. 使用纹理内存是解决这个问题的好方法,但它不一定是最佳方式。

At the very least, try to do 4 pixels at a time (horizontally) per thread. 至少,尝试每个线程一次(水平)做4个像素。 Global memory can be accessed 128 bytes at a time (as long as you have a warp trying to access any byte in a 128-byte interval, you might as well pull in the whole cache line at almost no additional cost). 全局内存一次可以访问128个字节(只要你有一个warp尝试访问128字节间隔内的任何字节,你也可以在几乎没有额外成本的情况下拉入整个缓存行)。 Since a warp is 32 threads, having each thread work on 4 pixels should be efficient. 由于warp是32个线程,每个线程在4个像素上工作应该是有效的。

Furthermore, you want to have vertically-adjacent pixels worked on by the same multiprocessor. 此外,您希望由同一个多处理器处理垂直相邻的像素。 The reason is that adjacent rows share the same input data. 原因是相邻行共享相同的输入数据。 If you have the pixel (x=0,y=0) worked on by one MP and the pixel (x=0,y=1) is worked on by a different MP, both MPs have to issue three global memory requests each. 如果由一个MP处理像素(x = 0,y = 0)并且像素(x = 0,y = 1)由不同的MP处理,则两个MP必须分别发出三个全局存储器请求。 If they are both worked on by the same MP and the results are properly cached (implicitly or explicitly), you only need a total of four. 如果它们都由同一个MP处理并且结果被正确缓存(隐式或显式),则您只需要总共四个。 This can be done by having each thread work on several vertical pixels, or by having blockDim.y>1. 这可以通过让每个线程在几个垂直像素上工作,或者让blockDim.y> 1来完成。

More generally, you'd probably want to have each 32-thread warp load as much memory as you have available on the MP (16-48 kb, or at least a 128x128 block), and then process all pixels within that window. 更一般地说,您可能希望每个32线程warp加载与MP上可用的内存(16-48 kb,或至少128x128块)一样多,然后处理该窗口内的所有像素。

On devices of compute compatibility before 2.0, you'll want to use shared memory. 在2.0之前的计算兼容性设备上,您将需要使用共享内存。 On devices of compute compatibility 2.0 and 2.1, caching capabilities are much improved, so global memory may be fine. 在计算兼容性2.0和2.1的设备上,缓存功能得到了很大改善,因此全局内存可能很好。

Some nontrivial savings could be had by making sure that each warp only accesses two cache lines in each horizontal row of input pixels, instead of three, as would happen in a naive implementation that works on 4 pixels per thread, 32 threads per warp. 通过确保每个warp只访问输入像素的每个水平行中的两个缓存行而不是三个缓存行,可以获得一些重要的节省,就像在每个线程4个像素,每个warp 32个线程的朴素实现中一样。

There's no good reason to use float as the buffer type. 使用float作为缓冲区类型是没有充分理由的。 Not only do you end up with four times the memory bandwidth, but the code becomes unreliable and bug-prone. 您不仅最终获得了四倍的内存带宽,而且代码变得不可靠并且容易出错。 (For example, are you sure that if(neighbors == 3) works correctly, since you're comparing a float and an integer?) Use unsigned char. (例如,你确定if(neighbors == 3)是否正常工作,因为你要比较一个浮点数和一个整数?)使用unsigned char。 Better yet, use uint8_t and typedef it to mean unsigned char if it's not defined. 更好的是,使用uint8_t并将其定义为无符号字符(如果未定义)。

Finally, don't underestimate the value of experimenting. 最后,不要低估实验的价值。 Quite often performance of CUDA code can't be easily explained by logic and you have to resort to tweaking parameters and seeing what happens. 很多时候,CUDA代码的性能很难通过逻辑来解释,你不得不求助于调整参数并看看会发生什么。

TL;DR: see: http://golly.sourceforge.net TL; DR:见: http//golly.sourceforge.net

The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. 问题在于,大多数CUDA实现遵循手动计算邻居的大脑死亡理念。 This is so dead slow that any smart serial CPU implementation will outperform it. 这非常慢,任何智能串行CPU实现都将超越它。

The only sensible way to do GoL calculations is using a lookup table. 进行GoL计算的唯一合理方法是使用查找表。
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside. CPU上当前最快的实现使用查找方形4x4 = 16位块来查看内部未来的2x2单元。

in this setup the cells are laid out like so: 在此设置中,单元格的布局如下:

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. 使用一些位移来使4×4块适合一个字,并使用查找表查找该字。 The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output. 查找表也包含单词,这样,结果的4个不同版本可以存储在查找表中,因此您可以最小化在输入和/或输出上完成的位移量。

In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9. Like so: 此外,不同的世代交错,所以你只需要看看4个相邻的板块,而不是9个。像这样:

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

This alone results in a 1000x+ speed-up compared to silly counting implementations. 与愚蠢的计数实施相比,仅此一项就可以实现1000倍以上的加速。

Then there is the optimization of not calculating slabs that are static or have a periodicity of 2. 然后是优化不计算静态或周期为2的板坯。

And then there is HashLife , but that's a different beast altogether. 然后是HashLife ,但这完全是一个不同的野兽。
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can. HashLife可以在O(log n)时间内生成生命模式,而不是正常实现可以生成的O(n)时间。 This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds. 这允许您在几秒钟内计算生成:6,366,548,773,467,669,985,195,496,000(6 octillion)。
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA. 不幸的是,Hashlife需要递归,因此在CUDA上很难。

have a look at this thread, we did allot of improvements over there ... 看看这个帖子,我们在那里做了很多改进......

http://forums.nvidia.com/index.php?showtopic=152757&st=60 http://forums.nvidia.com/index.php?showtopic=152757&st=60

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM