简体   繁体   中英

Can I allocate more memory than necessary with cudaMalloc to avoid reallocating?

I am writing an code that does calculations with thousands of sparse matrices on the GPU using cuSparse. Because memory is limited on the GPU, I need to treat them one by one as the rest of the memory is taken up by other GPU variables and dense matrices.

My work flow (in pseudo-code) is the following:

for (i=0;i<1000;i++){
//allocate sparse matrix using cudaMalloc
//copy sparse matrix from host using cudaMemcpy
//do calculation by calling cuSparse
//deallocate sparse matrix with cudaFree
}

In the above, I allocate and free the memory for each sparse matrix in each step because their sparsities vary and therefore the memory needed by each one varies.

Can I instead do something like:

//allocate buffer once in the beginning using cudaMalloc with some extra space such 
//that even the sparse matrix with the highest density would fit.
for (i=0;i<1000;i++){
//copy sparse matrix from host using cudaMemcpy to the same buffer
//do calculation by calling cuSparse
}
//free the buffer once at the end using cudaFree

The above avoids having to malloc and free the buffer in each iteration. Would the above work? Would it improve performance? Is it good practice or is there a better way to do this?

The above avoids having to malloc and free the buffer in each iteration. Would the above work?

In principle, yes.

Would it improve performance?

Probably. Memory allocation and deallocation isn't without latency.

Is it good practice or is there a better way to do this?

Generally speaking, yes. Lots of widely used GPU accelerated frameworks (Tensorflow, for example) use this strategy to reduce the cost of memory management on the GPU. Whether there is benefit for your use case requires you testing it for yourself.

tl;dr: Yes, pre-allocate

I'll be slightly more blunt than @talonmies:

cudaMalloc() and cudaFree() are very slow. They are also not necessary when you have no other potential contender for GPU memory - just "take it all" by allocating as much as you expect to possibly use. Then use a sub-allocator, or an allocator initialized with a given slab, to sub-allocae within that. If the framework you work with provides this, use it; Otherwise, write it yourself or look for a library to do it for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM