简体繁体 English

CUDA中的每线程哈希表式数据结构实现

[英]Per-thread hashtable-like data structure implementation in CUDA

原文 2011-07-24 19:53:27 7 1 c/ cuda/ parallel-processing

Short version of my question: I have a CUDA program where each thread needs to store numbers in different "bins", and I identify each of these bins by an integer. 我的问题的简短版本：我有一个CUDA程序，其中每个线程需要将数字存储在不同的“bin”中，并且我用整数标识每个bin。 For a typical run of my program, each CUDA thread might only store numbers in 100 out of millions of bins, so I'd like to know if there is a data structure other than an array that would allow me to hold this data. 对于我的程序的典型运行，每个CUDA线程可能只在100个数百个容器中存储数字，所以我想知道是否有一个数据结构而不是数组允许我保存这些数据。 Each thread would have its own copy of this structure. 每个线程都有自己的结构副本。 If I were programming in Python, I would just use a dictionary where the bin numbers are the keys, for example mydict[0] = 1.0, mydict[2327632] = 3.0, and then at the end of the run I would look at the keys and do something with them (and ignore the bins where no numbers are stored in them since they aren't in the dictionary). 如果我用Python编程，我会使用一个字典，其中bin编号是键，例如mydict [0] = 1.0，mydict [2327632] = 3.0，然后在运行结束时我会看到键和他们做的事情（并忽略没有数字存储在其中的箱子，因为他们不在字典中）。 I tried implementing a hash table for every thread in my cuda program and it killed performance. 我尝试为我的cuda程序中的每个线程实现一个哈希表，并且它杀死了性能。

Long version: I have a CUDA Monte Carlo simulation which simulates the transport of particles through a voxelized (simple volume elements) geometry. 长版：我有一个CUDA蒙特卡罗模拟模拟粒子通过体素（简单体积元素）几何体的传输。 The particles deposit energy during their transport and this energy is tallied on a voxel-per-voxel basis. 颗粒在运输过程中沉积能量，并且这种能量以体素/体素为基础计算。 The voxels are represented as a linearized 3D grid which is quite large, around 180^3 elements. 体素被表示为线性化的3D网格，其非常大，大约180 ^ 3个元素。 Each CUDA thread transports 1-100 particles and I usually try to maximize the number of threads that I spawn my kernel with. 每个CUDA线程传输1-100个粒子，我通常会尝试最大化我生成内核的线程数。 (Currently, I use 384*512 threads). （目前，我使用384 * 512线程）。 The energy deposited in a given voxel is added to the linearized 3d grid which resides in global memory through atomicAdd. 沉积在给定体素中的能量被添加到线性化的3d网格中，该网格通过atomicAdd驻留在全局存储器中。

I'm running into some problems with a part of my simulation which involves calculating uncertainties in my simulation. 我在模拟的一部分中遇到了一些问题，包括计算模拟中的不确定性。 For a given particle, I have to keep track of where (which voxel indices) it deposits energy, and how much energy for a given voxel, so that I can square this number at the end of the particle transport before moving on to a new particle. 对于给定的粒子，我必须跟踪它在哪里（哪个体素指数）沉积能量，以及给定体素的能量，以便我可以在粒子传输结束时将这个数字平方，然后再转向新的粒子。 Since I assign each thread one (or a few) particle, this information has to be stored at a per-thread scope. 由于我为每个线程分配了一个（或几个）粒子，因此必须将该信息存储在每个线程范围内。 The reason I only run into this problem with uncertainty calculation is that energy deposition can just be done as an atomic operation to a global variable every time a thread has to deposit energy, but uncertainty calculation has to be done at the end of a particle's transport, so I have to somehow have each thread keep track of the "history" of their assigned particles. 我只在不确定性计算中遇到这个问题的原因是，每次线程必须存储能量时，能量沉积只能作为全局变量的原子操作来完成，但是必须在粒子传输结束时进行不确定性计算。，所以我必须以某种方式让每个线程跟踪其指定粒子的“历史”。

My first idea was to implement a hash table whose key would be the linearized voxel index, and value would be energy deposited, and I would just square every element in that hash table and add it to a global uncertainty grid after a particle is done transporting. 我的第一个想法是实现一个哈希表，其键是线性化的体素索引，值将是能量沉积，我只是将该哈希表中的每个元素都对齐，并在粒子完成传输后将其添加到全局不确定性网格中。 I tried to implement uthash but it destroyed the performance of my code. 我试图实现uthash但它破坏了我的代码的性能。 I'm guessing it caused a huge amount of thread divergence. 我猜它引起了大量的线程分歧。

I could simply use two dynamic arrays where one stores the voxel index and the other would store the energy deposited for that voxel, but I am thinking that it would also be very bad for performance. 我可以简单地使用两个动态数组，其中一个存储体素索引，另一个存储为该体素存储的能量，但我认为它对性能也非常不利。 I'm hoping that there is a data structure that I don't know about which would lend itself well to being used in a CUDA program. 我希望有一个我不知道的数据结构，它可以很好地用于CUDA程序。 I also tried to include many details in case I am completely wrong in my approach to the problem. 我还尝试包含许多细节，以防我在处理问题时完全出错。

Thank you 谢谢

1 个解决方案

Your question is a bit jargon-ful. 你的问题有点行话。 If you can distill out the science and leave just the computer science, you might get more answers. 如果你能提炼出科学并只留下计算机科学，你可能会得到更多的答案。

There have been CUDA hash tables implemented . 已经实现了CUDA哈希表。 The work at that link will be included in the 2.0 release of the CUDPP library . 该链接的工作将包含在CUDPP库的2.0版本中。 It is already working in the SVN trunk of CUDPP , if you would like to try it. 它已经在CUDPP的SVN主干中工作了，如果你想尝试的话。

That said, if you really only need per-thread storage, and not shared storage, you might be able to do something much simpler, like some per-thread scratch space (in shared or global memory) or a local array. 也就是说，如果您真的只需要每线程存储而不需要共享存储，那么您可以做一些更简单的事情，比如一些每线程暂存空间（在共享或全局内存中）或本地数组。