简体繁体中英

Heap optimized for (but not limited to) single-threaded usage

原文 2012-05-21 15:52:32 0 2 c++/ c/ heap

I use a custom heap implementation in one of my projects. It consists of two major parts:

Fixed size-block heap. Ie a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.
It performs allocation/freeing fast (in O(1)) and there's no memory usage overhead, not taking into account things imposed by the external heap.
Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.
Since the whole application is (heavily) multi-threaded - the global heap locks the appropriate bucket during its operation.
Note : in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the "bucket configuration" is known at compile-time (implemented via C++ template voodoo) - the appropriate bucket is determined at compile time.

So far everything looks (and works) good.

Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking . That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded - the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.

I've fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.

This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?

EDIT:

Special thanks to @Voo for suggesting checking out the google's tcmalloc.

It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching .

I too thought in this direction, but I thought about maintaining per-thread heaps . Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) - no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.

OTOH the idea with caching seems much simpler, and more efficient. I'll try to work it out.

Thanks a lot.

PS:

Indeed google's tcmalloc is great. I believe it's implemented pretty much similar to what I did (at least fixed-size part).

But, to be pedantic, there's one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It's 4/64K to be exact.

2 answers

One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).

When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it's last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).

This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, eg by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.

An advantage of this scheme is that it will tend to eliminate false sharing , as memory for a given thread will tend to be allocated in contiguous address spaces.

On a side note, if you have not already read it, I suggest IBM's Inside Memory Management article for anyone implementing their own memory management.

UPDATE

If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.

Hoarde
tcmalloc (thanks Voo )

It might be a good idea to read Jeff Bonwicks classic papers on the slab allocator and vmem. The original slab allocator sounds somewhat what you're doing. Although not very multithread friendly it might give you some ideas.

The Slab Allocator: An Object-Caching Kernel Memory Allocator

Then he extended the concept with VMEM, which will definitely give you some ideas since it had very nice behavior in a multi cpu environment.

Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

Single producer and multiple single-threaded consumers

Threadsafe OpenSSL in a single-threaded use?

Single-threaded build for specific object file

Single-threaded and multi-threaded code taking the same time

Are pre-main global initializers guaranteed to run single-threaded?

Mergesort pThread implementation taking same time as single-threaded

Is lock-free multithreading slower than a single-threaded program?

Brute-force, single-threaded prime factorization

CoCreateInstance hangs in single-threaded exe COM client

What is the significance of having a function reentrant in a single-threaded environment?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Single producer and multiple single-threaded consumers Threadsafe OpenSSL in a single-threaded use? Single-threaded build for specific object file Single-threaded and multi-threaded code taking the same time Are pre-main global initializers guaranteed to run single-threaded? Mergesort pThread implementation taking same time as single-threaded Is lock-free multithreading slower than a single-threaded program? Brute-force, single-threaded prime factorization CoCreateInstance hangs in single-threaded exe COM client What is the significance of having a function reentrant in a single-threaded environment?

Related Tags

Heap optimized for (but not limited to) single-threaded usage

Question

2 answers

solution1
10 ACCPTED 2012-05-21 16:05:00

solution2
0 2012-05-22 07:35:46

Heap optimized for (but not limited to) single-threaded usage

Question

2 answers

solution1 10 ACCPTED 2012-05-21 16:05:00

solution2 0 2012-05-22 07:35:46

solution1
10 ACCPTED 2012-05-21 16:05:00

solution2
0 2012-05-22 07:35:46