简体   繁体   English

如何释放Numba cuda使用的GPU memory?

[英]How to release the GPU memory used by Numba cuda?

x_cpu, y_cpu, z_cpu are big numpy arrays with same length, Result is the Grid result that will reduce the x,y,z resolution and only keep one point in each grid, they cannot be put into GPU memory together. x_cpu, y_cpu, z_cpu 都是大的 numpy arrays,长度相同,Result 是会降低 x,y,z 分辨率的 Grid 结果,每个 grid 只保留一个点,不能放在一起 GPU memory。 so I divided x,y,z into several parts but still put the whole Result into the GPU memory used所以我将 x,y,z 分成几个部分,但仍然将整个结果放入 GPU memory 使用

from numba import cuda
from math import ceil

SegmentSize = 1000000
Loops = ceil(len(x_cpu),SegmentSize)
Result = cuda.device_array((maxX-minX,maxY-minY))
for lopIdx in range(Loops):
    x = cuda.to_device(x_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    y = cuda.to_device(y_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    z = cuda.to_device(z_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    CudaProc[blocks, 1024](x,y,z, Result)
    cuda.synchronize()
Result_CPU = Result.copy_to_host()

But when I did so, Unknown Cuda Error raised.但是当我这样做时,出现了 Unknown Cuda 错误。 I noticed that the occupied GPU memory kept raising.我注意到占用的 GPU memory 一直在上升。 I think it is because that in the loops, it keeps writing new x, y, z into the GPU memory without releasing the x,y,z before.我认为这是因为在循环中,它不断将新的 x、y、z 写入 GPU memory 之前没有释放 x、y、z。 I couldn't find much information about how to release the GPU memory. Can anyone help?我找不到太多关于如何发布 GPU memory 的信息。有人可以帮忙吗?

You are pretty much at the mercy of standard Python object life semantics and Numba internals (which are terribly documented) when it comes to GPU memory management in Numba.当涉及到 Numba 中的 GPU memory 管理时,您几乎受制于标准 Python object 生命语义和 Numba 内部机制(这些内容被详细记录)。 The best solution is probably to manage everything as explicitly as possible, which means not performing GPU object creation in things like loops unless you understand it will be trivial to performance and resource consumption.最好的解决方案可能是尽可能明确地管理所有内容,这意味着不要在循环中执行 GPU object 创建,除非您了解这对性能和资源消耗来说是微不足道的。

I would suggest moving GPU array creation out of the loop:我建议将 GPU 数组创建移出循环:

from numba import cuda
from math import ceil

SegmentSize = 1000000
Loops = ceil(len(x_cpu),SegmentSize)
Result = cuda.device_array((maxX-minX,maxY-minY)) #you explicitly should type these
x = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these
y = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these
z = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these

for lopIdx in range(Loops):
    x.copy_to_device(x_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    y.copy_to_device(y_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    z.copy_to_device(z_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    CudaProc[blocks, 1024](x,y,z, Result)
    cuda.synchronize()
Result_CPU = Result.copy_to_host()

[ Code written in browser, never tested, use at own risk ] [在浏览器中编写的代码,未经测试,使用风险自负]

That way you ensure that the memory is only allocated once and you reuse the same memory through all the loop trips.这样可以确保 memory 仅分配一次,并且可以在所有循环行程中重复使用相同的 memory。 You still don't have explicit control of when the intermediate arrays will be destroyed, but this way it prevents running out of memory within the loop.您仍然无法明确控制何时销毁中间 arrays,但这样可以防止在循环内用完 memory。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM