简体繁体 English

从Cuda转移大量可变内存

[英]Transferring large variable amount of memory from Cuda

原文 2012-06-23 02:09:43 9 1 c/ cuda

Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Cuda真棒，而且我疯狂地使用它但是我并没有充分发挥它的潜力，因为我有一个转移内存的问题，并且想知道是否有更好的方法来获得可变数量的内存。 Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. 基本上我发送65535项数组到Cuda和Cuda分析每个数据项约20,000种不同的方式，如果在我的程序逻辑中匹配，那么它将保存一个30 int列表作为结果。 Think of my logic of analzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results(which is a 30 int list for each analyzed item). 想想我分析每个不同组合的逻辑，然后查看总数，如果总数等于我正在寻找的数字，那么它会保存结果（这是每个分析项目的30个int列表）。

The problem is 65535(blocks/items in data array) * 20000(total combinations tested per item) = 1,310,700,000. 问题是65535（数据阵列中的块/项目）* 20000（每个项目测试的总组合数）= 1,310,700,000。 This means I need to create a array of that size to deal with the chance that all the data will be a positive match(which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). 这意味着我需要创建一个这样大小的数组来处理所有数据都是正匹配的可能性（这是极不可能的，并且创建int output[1310700000][30]对于内存来说似乎很疯狂）。 I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list(with this approach the it writes the output to host memory using block * number_of_different_way_tests). 我不得不把它缩小并发送更少的块来处理，因为我不知道Cuda如何有效地写入链表或动态大小的列表（使用这种方法，它使用块将输出写入主机内存） * number_of_different_way_tests）。

Is there a better way to do this? 有一个更好的方法吗？ Can Cuda somehow write to free memory that is not derived from the blockid? Cuda可以以某种方式写入不是从blockid派生的空闲内存吗？ When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel. 当我在CPU上测试这个过程时，少于10％的项目数组具有正匹配，所以每次我将工作发送到内核时，我都不太可能使用这么多内存。

ps I'm looking above and although its exactly what I'm doing, if its confusing then another way of thinking about it(not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays(that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number(say 200-210) then I want to know the numbers it added to get that matching result. ps我正在看上面，虽然它正是我正在做的，如果它令人困惑然后另一种思考方式（不完全是我正在做的但是足以理解问题）是我发送20,000个数组（那每个包含65,535个项目）并在其他数组中添加其对等项，如果总数等于一个数字（比如200-210），那么我想知道它为获得匹配结果而添加的数字。 If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. 如果数字的范围非常广泛，那么并非所有数据都匹配，但使用我的方法，我被迫使用大量的内存。 Can I capture the results with mallocing less memory? 我可以通过mallocing更少的内存来捕获结果吗？ My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient(I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). 我当前的方法是malloc，因为我有免费但我被迫运行更少的块效率不高（我想运行尽可能多的块和线程，因为我喜欢Cuda组织和运行块的方式）。 Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results(and buying alot more memory)? 我可以使用任何Cuda或C技巧，或者我仍然坚持使用最大可能结果（并购买更多内存）？

1 个解决方案

As Per Roger Dahl's great answer : The functionality you're looking for is called stream compaction. 正如Per Roger Dahl的好回答：您正在寻找的功能称为流压缩。

You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. 您可能需要提供一个包含每个线程4个解决方案空间的数组，因为尝试以紧凑的形式直接存储结果可能会在线程之间创建如此多的依赖关系，从而能够将较少的数据复制回到主机因较长的内核执行时间而丢失。 The exception to this is if almost all of the threads find no solutions. 例外情况是，几乎所有线程都找不到解决方案。 In that case, you might be able to use an atomic operation to maintain an index into an array. 在这种情况下，您可以使用原子操作来维护数组的索引。 So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. 因此，对于找到的每个解决方案，您将它存储在索引的数组中，然后使用原子操作来增加索引。 I think it would be safe to use atomicAdd() for this. 我认为使用atomicAdd（）是安全的。 Before storing a result, the thread would use atomicAdd() to increase the index by one. 在存储结果之前，线程将使用atomicAdd（）将索引增加一。 atomicAdd() returns the old value, and the thread can store the result using the old value as the index. atomicAdd（）返回旧值，线程可以使用旧值作为索引来存储结果。

However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. 然而，考虑到更常见的情况，即有相当数量的结果，最好的解决方案是将压缩操作作为单独的步骤执行。 One way to do this is with thrust::copy_if. 一种方法是使用thrust :: copy_if。 See this question for some more background. 有关更多背景信息，请参阅此问题。