简体   繁体   English

线程安全内存池

[英]Thread safe memory pool

My application currently is highly performance critical and is requests 3-5 million objects per frame. 我的应用程序目前对性能非常重要,每帧需要3-5万个对象。 Initially, to get the ball rolling, I new'd everything and got the application to work and test my algorithms. 最初,为了让球滚动,我新new'd一切,让应用程序工作并测试我的算法。 The application is multi-threaded. 该应用程序是多线程的。

Once I was happy with the performance, I started to create a memory manager for my objects. 一旦我对性能感到满意,我就开始为我的对象创建一个内存管理器。 The obvious reason is memory fragmentation and wastage. 显而易见的原因是内存碎片和浪费。 The application could not continue for more than a few frames before crashing due to memory fragmentation. 由于内存碎片,应用程序在崩溃之前无法继续超过几帧。 I have checked for memory leaks and know the application is leak free. 我检查了内存泄漏,并知道应用程序是无泄漏的。

So I started creating a simple memory manager using TBB's concurrent_queue . 所以我开始使用TBB的concurrent_queue创建一个简单的内存管理器。 The queue stores a maximum set of elements the application is allowed to use. 队列存储允许应用程序使用的最大元素集。 The class requiring new elements pops elements from the queue. 需要新元素的类弹出队列中的元素。 The try_pop method is, according to Intel's documentation, lock-free. 根据英特尔的文档, try_pop方法是无锁的。 This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). 就内存消耗而言,这种方法效果很好(尽管仍然存在内存碎片,但并不像以前那么多)。 The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated). 我现在面临的问题是,根据我自己的简单分析器,应用程序的性能已经放慢了大约4倍(我无法访问商业分析器或知道任何可用于实时应用程序的任何建议...任何建议将不胜感激。

My question is, is there a thread-safe memory pool that is scalable. 我的问题是,是否存在可扩展的线程安全内存池。 A must-have feature of the pool is fast recycling of elements and making them available. 游泳池的must-have功能是快速回收元素并使其可用。 If there is none, any tips/tricks performance wise? 如果没有,任何提示/技巧表现明智吗?

EDIT: I thought I would explain the problem a bit more. 编辑:我想我会更多地解释这个问题。 I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. 我可以轻松初始化n个数组,其中n是线程数,并开始使用每个线程数组中的对象。 This will work perfectly for some cases. 这对某些情况非常有效。 In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; 就我而言,我也在回收元素(可能是每一帧),它们可以在阵列中的任何一点进行回收; ie it may be from elementArray[0] or elementArray[10] or elementArray[1000] part of the array. 即它可能来自elementArray[0]elementArray[10]elementArray[1000]数组的一部分。 Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :( 现在我将有一个零碎的元素数组,包含可以使用的元素和正在使用的元素:(

As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread. 如评论中所述,不要获得线程安全的内存分配器,为每个线程分配内存。

As you implied in your update, you need to manage free/in-use effectively. 正如您在更新中暗示的那样,您需要有效地管理免费/使用中。 That is a pretty straightforward problem, given a constant type and no concurrency. 这是一个非常直接的问题,给定一个常量类型,没有并发性。

For example (off the top of my head, untested): 例如(在我的头顶,未经测试):

template<typename T>
class ThreadStorage
{
    std::vector<T> m_objs;
    std::vector<size_t> m_avail;

public:
    explicit ThreadStorage(size_t count) : m_objs(count, T()) {
        m_avail.reserve(count);
        for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
    }

    T* alloc() {
        T* retval = &m_objs[0] + m_avail.back();
        m_avail.pop_back();
        return retval;
    }

    void free(T* p) {
        *p = T(); // Assuming this is enough destruction.
        m_avail.push_back(p - &m_objs[0]);
    }
};

Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required. 然后,对于每个线程,有一个ThreadStorage实例,并根据需要调用alloc()和free()。

You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive. 您可以添加智能指针来管理调用free(),如果价格昂贵,您可以优化构造函数/析构函数调用。

You can also look at boost::pool. 你也可以看一下boost :: pool。

Update: 更新:

The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. 跟踪已经使用的东西以便可以在第二次传递中处理的新要求对我来说似乎有点不清楚。 I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. 我认为你的意思是当主要处理在一个对象上完成时,你需要不释放它,但是为第二阶段处理保留对它的引用。 Some objects you will just be released back to the pool and not used for second stage processing. 有些对象只会被释放回池中而不用于第二阶段处理。

I assume you want to do this in the same thread. 我假设您想在同一个线程中执行此操作。

As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required. 作为第一遍,您可以向ThreadStorage添加这样的方法,并在想要对所有未发布的T实例进行处理时调用它。无需额外的簿记。

void do_processing(boost::function<void (T* p)> const& f) {
    std::sort(m_avail.begin(), m_avail.end());

    size_t o = 0;
    for (size_t i = 0; i != m_avail.size(); ++i) {
        if (o < m_avail[i]) {
            do {
                f(&m_objs[o]);
            } while (++o < m_avail[i]);
            ++o;
        } else of (o == m_avail[i])
            ++o;
    }

    for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}

Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. 假设没有其他线程正在使用ThreadStorage实例,这是合理的,因为它是线程本地的设计。 Again, off the top of my head, untested. 再次,在我的头顶,未经测试。

Google's TCMalloc , 谷歌的TCMalloc

TCMalloc assigns each thread a thread-local cache. TCMalloc为每个线程分配线程本地缓存。 Small allocations are satisfied from the thread-local cache. 线程本地缓存满足小分配。 Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures. 根据需要将对象从中央数据结构移动到线程本地缓存中,并使用周期性垃圾收集将内存从线程本地缓存迁移回中央数据结构。

Performance: 性能:

TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). TCMalloc比glibc 2.3 malloc更快... ptmalloc2需要大约300纳秒才能在2.8 GHz P4上执行malloc / free对(对于小对象)。 The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair... 对于相同的操作对,TCMalloc实现大约需要50纳秒......

你可能想看看jemalloc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM