CUDA推力内存分配问题

Question

I have a Thrust code which loads a big array of data (2.4G) into memory, perform calculations which results are stored in host (~1.5G), then frees the initital data, load the results into device, perform other calculations on it, and finally reloads the initial data. 我有一个Thrust代码，它将大量数据（2.4G）加载到内存中，执行将结果存储在主机中的计算（〜1.5G），然后释放初始数据，将结果加载到设备中，对其执行其他计算，最后重新加载初始数据。 The thrust code looks like this: 推力代码如下所示：

thrust::host_device<float> hostData;
// here is a code which loads ~2.4G of data into hostData
thrust::device_vector<float> deviceData = hostData;
thrust::host_vector<float> hostResult;
// here is a code which perform calculations on deviceData and copies the result to hostResult (~1.5G)
free<thrust::device_vector<float> >(deviceData);
thrust::device_vector<float> deviceResult = hostResult;
// here is code which performs calculations on deviceResult and store some results also on the device
free<thrust::device_vector<float> >(deviceResult);
deviceData = hostData;

With my defined function free: 使用我定义的函数免费：

template<class T> void free(T &V) {
    V.clear();
    V.shrink_to_fit();
    size_t mem_tot;
    size_t mem_free;
    cudaMemGetInfo(&mem_free, &mem_tot);
    std::cout << "Free memory : " << mem_free << std::endl;
}

template void free<thrust::device_vector<int> >(thrust::device_vector<int>& V);
template void free<thrust::device_vector<float> >(
    thrust::device_vector<float>& V);

However, I get a "thrust::system::detail::bad_alloc' what(): std::bad_alloc: out of memory" error when trying to copy hostData back to deviceData even though cudaMemGetInfo returns that at this point I have ~6G of free memory of my device. 但是，尝试将hostData复制回deviceData时，即使cudaMemGetInfo返回该信息，我仍然遇到“推力::系统::详细信息:: bad_alloc'what（）：std :: bad_alloc：内存不足”错误，尽管我现在已经知道了〜我的设备的6G可用内存。 Here is the complete output from the free method: 这是free方法的完整输出：

Free memory : 6295650304
Free memory : 6063775744
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what():  std::bad_alloc: out of memory

It seems to indicate that the device is out of memory although there is plenty free. 尽管有大量可用空间，但这似乎表明设备内存不足。 Is it the right way to free memory for Thrust vectors? 这是释放推力向量的正确方法吗？ I should also note that the code works well for a smaller size of data (up to 1.5G) 我还要注意，该代码对于较小的数据（最大1.5G）效果很好

Answer 1

It would be useful to see a complete, compilable reproducer code. 看到完整的，可编译的复制器代码将很有用。 However you're probably running into memory fragmentation. 但是，您可能会遇到内存碎片的问题。

Even though a large amount of memory may be reported as being free, it's possible that it can't be allocated in a single large contiguous chunk. 即使可能将大量内存报告为空闲，也有可能无法在单个大的连续块中分配它。 This fragmentation will then limit the maximum size of a single allocation that you can request. 然后，此碎片将限制您可以请求的单个分配的最大大小。

It's probably not really a question of how you are freeing memory, but more a function of what overhead allocations remain after you free the memory. 这实际上可能不是您如何释放内存的问题，而是更多有关释放内存后剩余开销分配的函数。 The fact that you are checking the mem info and getting a large number back says to me that you are freeing your allocations correctly. 您正在检查内存信息并获取大量回执的事实对我说，您正在正确释放分配。

To try to work around this, one approach would be to manage and re-use your allocations carefully. 要尝试解决此问题，一种方法是仔细管理和重用您的分配。 For instance, if you need a large 2.4G working device vector of float on the device, then allocate that once, and re-use it for successive operations. 举例来说，如果你需要一个大的2.4G工作的设备矢量float的设备上，然后分配，有一次，并重新使用它进行连续操作。 Also, if you have any remaining allocations on the device immediately before you are trying to re-allocate the 2.4G vector, then try freeing those (ie free all allocations you have made on the device) before trying to re-allocate the 2.4G vector. 另外，如果您在尝试重新分配2.4G向量之前立即在设备上有任何剩余分配，请在尝试重新分配2.4G之前尝试释放那些（即释放设备上已进行的所有分配）向量。

Answer 2

I'm providing this answer as I came across this question when searching for answers to the same error message / problem. 当我搜索同一错误消息/问题的答案时遇到此问题时，我正在提供此答案。

Robert Crovella's excellent answer is certainly correct, however, it may be useful for others to know that when creating/requesting a device_vector the capacity of the device_vector allocated is far greater than the size of the device_vector requested. 罗伯特Crovella出色的答案当然是正确的，但是，它可能是有用的人知道，当创建/请求device_vector的容量 device_vector分配远大于规模更大device_vector要求。

This answer : Understanding Thrust (CUDA) memory usage , explains in much better detail why Thrust behaves in this way. 这个答案是：了解Thrust（CUDA）的内存使用情况，它更详细地说明了Thrust为什么以这种方式运行。

In my case, on Ubuntu 16.04, Quadro K1200, CUDA toolkit 8.0, requesting a device_vector of size 67108864 (doubles) resulted in a device_vector with a capacity 8x larger (536870912) being allocated. 就我而言，在Ubuntu 16.04的Quadro K1200，CUDA工具包8.0，请求device_vector大小67108864（双打）的结果在device_vector具有容量大8倍（536870912）被分配。

Requested (R) | Capacity (C)  | Total Mem  | Free Mem   | C/Free   | R/C
67108864      | 536870912     | 4238540800 | 3137077248 | 0.171137 | 0.125

The output above was from modifying some very helpful code in the answer I linked to. 上面的输出来自修改我链接到的答案中的一些非常有用的代码。

CUDA推力内存分配问题

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-07-26 13:25:56

解决方案2
1 2017-08-12 07:55:25

CUDA推力内存分配问题

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-07-26 13:25:56

解决方案2 1 2017-08-12 07:55:25

解决方案1
2 已采纳 2013-07-26 13:25:56

解决方案2
1 2017-08-12 07:55:25