简体   繁体   中英

cannot allocate memory fast enough?

Assume you are tasked to address a performance bottleneck in an application. Via profiling we discover the bottleneck is related to memory allocation. We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory. Why would we be seeing this behavior and how might we increase the rate at which the application can allocate memory. (Assume that we cannot change the size of the memory blocks that we are allocating. Assume that we cannot reduce the use of dynamically allocated memory.)

Okay, a few solutions exist - however almost all of them seem to be excluded via some constraint or another.

1. Have more threads allocate memory

We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory.

From this, we can cross-off any ideas of adding more threads (since "no matter how many threads"...).

2. Allocate more memory at a time

Assume that we cannot change the size of the memory blocks that we are allocating.

Fairly obviously, we have to allocate the same block size.

3. Use (some) static memory

Assume that we cannot reduce the use of dynamically allocated memory.

This one I found most interesting.. Reminded me of a story I heard about a FORTRAN programmer (before Fortran had dynamic memory allocation) whom just used a HUGE static array allocated on the stack as a private heap. Unfortunately, this constraint prevents us from using such a trick.. However, it does give a glean into one aspect of a (the) solution.


My Solution

At the start of execution (either of the program, or on a per-thread basis) make several ^ memory allocation system calls. Then use the memory from these later in the program (along with the existing dynamic memory allocations).

* Note: The 'several' would probably be an exact number, determined from your profiling , which the question mentions in the beginning .

TL;DR

The trick is to modify the timing of the memory allocations.

... the application can only perform N memory allocations per second, no matter how many threads we have allocating memory. Why would we be seeing this behavior and how might we increase the rate at which the application can allocate memory.

IMHO, the most likely cause is that the allocations are coming from a common system pool.

Because they share a pool, each thread has to gain access thru some critical section blocking mechanism (perhaps a semaphore).

The more threads competing for dynamic memory (ie using new) will cause more critical section blocking.

The context switch between tasks is the time waste here.


How increase the rate?

option 1 - serialize the usage ... and this means, of course, that you can not simply try to use a semaphore at another level. For one system I worked on, a high dynamic memory utilization happened during system start up. In that case, it was easiest to change the start up such that thread n+1 (of this collection) only started after thread n had completed its initialization and fell into its wait-for-input loop. With only 1 thread doing its start up thing at a time, (and very few other dynamic memory users yet running) no critical section blockage occurred. 4 simultaneous start ups would take 30 seconds. 4 serialized startups finished in 5 seconds.

option 2 - provide a pool of ram and a private new/delete for each particular thread. If only one thread access a pool at a time, a critical section or semaphore is not needed. In an embedded system, the challenge here is allocate a reasonable amount of private pool for the thread and not too much waste. On a desktop with multi-gigabytes of ram, this is probably less of a problem.

Looks like a challenging problem, though without details, you can only do some guesses. (Which is most likely the idea of this question)

The limitation here is the number of allocations, not the size of the allocation. If we can assume that you are in control of where it allocations occur, you can allocate the memory for multiple instances at once. Please consider the code below as pseudo code, as it's only for illustration purpose.

const static size_t NR_COMBINED_ALLOCATIONS = 16;
auto memoryBuffer = malloc(size_of(MyClass)*NR_COMBINED_ALLOCATIONS);
size_t nextIndex = 0;
// Some looping code
    auto myNewClass = new(memoryBuffer[nextIndex++]) MyClass;
    // Some code
    myNewClass->~MyClass();
free(memoryBuffer);

Your code will most likely become a lot more complex, though you will most likely tackle this bottleneck. In case you have to return this new class, you even need even more code just to do memory management.

Given this information, you can write your own implementation of allocators for your STL, override the 'new' and 'delete' operators ...

If that would not be enough, try challenging the limitations. Why can you only do a fixed number of allocations, is this because of unique locking? If so, can we improve this? Why do you need that many allocations, would changing the algorithm that is being used fix this issue ...

I believe you could use a separate thread which could be responsible for memory allocation. This thread would have a queue containing a map of thread identifiers and needed memory allocation. Threads would not directly allocate memory, but rather send an allocation request to the queue and go into a wait state. The queue, on its turn would try to process each requested memory allocation from the queue and wake the corresponding sleeping thread up. When the thread responsible for memory handling can not process an allocation due to limitation, it should wait until memory can be allocated again.

One could build another layer into the solution as @Tersosauros's solution suggested to slightly optimize speed, but it should be based on something like the idea above nonetheless.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM