简体   繁体   English

CUDA中的每线程哈希表式数据结构实现

[英]Per-thread hashtable-like data structure implementation in CUDA

Short version of my question: I have a CUDA program where each thread needs to store numbers in different "bins", and I identify each of these bins by an integer. 我的问题的简短版本:我有一个CUDA程序,其中每个线程需要将数字存储在不同的“bin”中,并且我用整数标识每个bin。 For a typical run of my program, each CUDA thread might only store numbers in 100 out of millions of bins, so I'd like to know if there is a data structure other than an array that would allow me to hold this data. 对于我的程序的典型运行,每个CUDA线程可能只在100个数百个容器中存储数字,所以我想知道是否有一个数据结构而不是数组允许我保存这些数据。 Each thread would have its own copy of this structure. 每个线程都有自己的结构副本。 If I were programming in Python, I would just use a dictionary where the bin numbers are the keys, for example mydict[0] = 1.0, mydict[2327632] = 3.0, and then at the end of the run I would look at the keys and do something with them (and ignore the bins where no numbers are stored in them since they aren't in the dictionary). 如果我用Python编程,我会使用一个字典,其中bin编号是键,例如mydict [0] = 1.0,mydict [2327632] = 3.0,然后在运行结束时我会看到键和他们做的事情(并忽略没有数字存储在其中的箱子,因为他们不在字典中)。 I tried implementing a hash table for every thread in my cuda program and it killed performance. 我尝试为我的cuda程序中的每个线程实现一个哈希表,并且它杀死了性能。

Long version: I have a CUDA Monte Carlo simulation which simulates the transport of particles through a voxelized (simple volume elements) geometry. 长版:我有一个CUDA蒙特卡罗模拟模拟粒子通过体素(简单体积元素)几何体的传输。 The particles deposit energy during their transport and this energy is tallied on a voxel-per-voxel basis. 颗粒在运输过程中沉积能量,并且这种能量以体素/体素为基础计算。 The voxels are represented as a linearized 3D grid which is quite large, around 180^3 elements. 体素被表示为线性化的3D网格,其非常大,大约180 ^ 3个元素。 Each CUDA thread transports 1-100 particles and I usually try to maximize the number of threads that I spawn my kernel with. 每个CUDA线程传输1-100个粒子,我通常会尝试最大化我生成内核的线程数。 (Currently, I use 384*512 threads). (目前,我使用384 * 512线程)。 The energy deposited in a given voxel is added to the linearized 3d grid which resides in global memory through atomicAdd. 沉积在给定体素中的能量被添加到线性化的3d网格中,该网格通过atomicAdd驻留在全局存储器中。

I'm running into some problems with a part of my simulation which involves calculating uncertainties in my simulation. 我在模拟的一部分中遇到了一些问题,包括计算模拟中的不确定性。 For a given particle, I have to keep track of where (which voxel indices) it deposits energy, and how much energy for a given voxel, so that I can square this number at the end of the particle transport before moving on to a new particle. 对于给定的粒子,我必须跟踪它在哪里(哪个体素指数)沉积能量,以及给定体素的能量,以便我可以在粒子传输结束时将这个数字平方,然后再转向新的粒子。 Since I assign each thread one (or a few) particle, this information has to be stored at a per-thread scope. 由于我为每个线程分配了一个(或几个)粒子,因此必须将该信息存储在每个线程范围内。 The reason I only run into this problem with uncertainty calculation is that energy deposition can just be done as an atomic operation to a global variable every time a thread has to deposit energy, but uncertainty calculation has to be done at the end of a particle's transport, so I have to somehow have each thread keep track of the "history" of their assigned particles. 我只在不确定性计算中遇到这个问题的原因是,每次线程必须存储能量时,能量沉积只能作为全局变量的原子操作来完成,但是必须在粒子传输结束时进行不确定性计算。 ,所以我必须以某种方式让每个线程跟踪其指定粒子的“历史”。

My first idea was to implement a hash table whose key would be the linearized voxel index, and value would be energy deposited, and I would just square every element in that hash table and add it to a global uncertainty grid after a particle is done transporting. 我的第一个想法是实现一个哈希表,其键是线性化的体素索引,值将是能量沉积,我只是将该哈希表中的每个元素都对齐,并在粒子完成传输后将其添加到全局不确定性网格中。 I tried to implement uthash but it destroyed the performance of my code. 我试图实现uthash但它破坏了我的代码的性能。 I'm guessing it caused a huge amount of thread divergence. 我猜它引起了大量的线程分歧。

I could simply use two dynamic arrays where one stores the voxel index and the other would store the energy deposited for that voxel, but I am thinking that it would also be very bad for performance. 我可以简单地使用两个动态数组,其中一个存储体素索引,另一个存储为该体素存储的能量,但我认为它对性能也非常不利。 I'm hoping that there is a data structure that I don't know about which would lend itself well to being used in a CUDA program. 我希望有一个我不知道的数据结构,它可以很好地用于CUDA程序。 I also tried to include many details in case I am completely wrong in my approach to the problem. 我还尝试包含许多细节,以防我在处理问题时完全出错。

Thank you 谢谢

Your question is a bit jargon-ful. 你的问题有点行话。 If you can distill out the science and leave just the computer science, you might get more answers. 如果你能提炼出科学并只留下计算机科学,你可能会得到更多的答案。

There have been CUDA hash tables implemented . 已经实现了CUDA哈希表 The work at that link will be included in the 2.0 release of the CUDPP library . 该链接的工作将包含在CUDPP库的2.0版本中。 It is already working in the SVN trunk of CUDPP , if you would like to try it. 它已经在CUDPPSVN主干中工作了,如果你想尝试的话。

That said, if you really only need per-thread storage, and not shared storage, you might be able to do something much simpler, like some per-thread scratch space (in shared or global memory) or a local array. 也就是说,如果您真的只需要每线程存储而不需要共享存储,那么您可以做一些更简单的事情,比如一些每线程暂存空间(在共享或全局内存中)或本地数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM