简体   繁体   English

std :: deque的内存开销是怎么回事?

[英]What the heque is going on with the memory overhead of std::deque?

I am working on an external sorting algorithm that uses std::queue and must carefully constrain its memory usage. 我正在研究一种使用std::queue的外部排序算法,并且必须小心地限制其内存使用量。 I have noticed that during the merge phase (which uses several std::queue s of fixed length), my memory usage increases to about 2.5X what I expected. 我注意到在合并阶段(使用固定长度的几个std::queue ),我的内存使用量增加到我预期的2.5倍左右。 Since std::queue by default uses std::deque as its underlying container, I ran some tests on std::deque to determine its memory overhead. 由于std::queue默认使用std::deque作为其底层容器,因此我在std::deque上运行了一些测试以确定其内存开销。 Here are the results, running on VC++ 9, in release mode, with a 64-bit process: 以下是在发布模式下在VC ++ 9上运行的结果,具有64位进程:

When adding 100,000,000 char s to a std::deque , the memory usage grows to 252,216K. std::deque添加100,000,000个char ,内存使用量增加到252,216K。 Note that 100M char s (1 byte) should occupy 97,656K, so this is an overhead of 154,560K. 请注意,100M char (1字节)应占用97,656K,因此这是154,560K的开销。

I repeated the test with double s (8 bytes) and saw memory grow to 1,976,676K, while 100M double s should occupy 781,250K, for an overhead of 1,195,426K!! 我用double s(8字节)重复测试,看到内存增长到1,976,676K,而100M double s应占用781,250K,开销为1,195,426K!

Now I understand that std::deque is normally implemented as a linked list of "chunks." 现在我明白std::deque通常是作为“块”的链表实现的。 If this is true, then why is the overhead proportional to the element size (because of course the pointer size should be fixed at 8 bytes)? 如果这是真的,那么为什么开销与元素大小成比例(因为指针大小当然应该固定在8个字节)? And why is it so danged huge? 为什么这么大呢?

Can anybody shed some light on why std::deque uses so much danged memory? 任何人都可以解释为什么std::deque使用如此多的危险记忆? I'm thinking I should switch my std::queue underlying containers to std::vector as there is no overhead (given that the appropriate size is reserve ed). 我想我应该将我的std::queue底层容器切换到std::vector因为没有开销(假设适当的大小是reserve )。 I'm thinking the benefits of std::deque are largely negated by the fact that it has such a huge overhead (resulting in cache misses, page faults, etc.), and that the cost of copying std::vector elements may be less, given that the overall memory usage is so much lower. 我认为std::deque的好处在很大程度上取决于它有如此巨大的开销(导致缓存未命中,页面错误等),并且复制std::vector元素的成本可能是更少,因为整体内存使用率低得多。 Is this just a bad implementation of std::deque by Microsoft? 这只是微软对std::deque的糟糕实现吗?

Look at the code for _DEQUESIZ (number of elements per block): 查看_DEQUESIZ的代码(每个块的元素数):

#define _DEQUESIZ   (sizeof (_Ty) <= 1 ? 16 \
    : sizeof (_Ty) <= 2 ? 8 \
    : sizeof (_Ty) <= 4 ? 4 \
    : sizeof (_Ty) <= 8 ? 2 : 1)    /* elements per block (a power of 2) */

It gets smaller if the element is larger. 如果元素更大,它会变小。 Only for elements larger than 8 bytes will you get the expected behavior (percentual decrease of overhead with increase of element size). 只有大于8个字节的元素才能获得预期的行为(随着元素大小的增加,开销的百分比减少)。

Is it possible that you are running Debug binaries? 您是否可能正在运行Debug二进制文件? 252MB for 100M chars does seem like a lot... 100M字符的252MB看起来确实很多......

You can check attribution of this using umdh to snapshot before and after and then compare the two - might shed some light on why it's larger than you expected. 您可以使用umdh检查之前和之后的快照,然后比较两者 - 可能会说明为什么它比您预期的更大。

EDIT: FYI - When I run this outside the debugger on VS2010 I get 181MB with char s. 编辑:仅供参考 - 当我在VS2010上的调试器外运行时,我得到181MB的char

deque<char> mydequeue;
for (size_t i = 0; i < 100 * 1024 * 1024; ++i)
{
  mydequeue.push_back(char(i));
}

EDIT: Supporting the other answer from @Dialecticus, this gives me the same footprint as double : 编辑:支持@Dialecticus的其他答案,这给了我与double相同的足迹:

struct twoInt64s
{
public:
    twoInt64s(__int64 _a, __int64 _b) : a(_a), b(_b) {}

    __int64 a;
    __int64 b;
};

EDIT: With _DEQUESIZ modified as shown (128 chars per block), 100M chars now takes 113M of memory. 编辑:如图所示修改_DEQUESIZ (每个块128个字符),100M字符现在占用113M内存。

My conclusion is that the remaining overhead you saw is due to management structures for the deque blocks, which have 16 chars of data, plus control info for deque plus more control info for heap manager. 我的结论是,您看到的剩余开销是由于deque块的管理结构,其中包含16个字符的数据,以及deque控制信息以及堆管理器的更多控制信息。

#define _DEQUESIZ   (sizeof (value_type) <= 1 ? 128 \
    : sizeof (value_type) <= 2 ? 8 \
    : sizeof (value_type) <= 4 ? 4 \
    : sizeof (value_type) <= 8 ? 2 \
    : 1)    /* elements per block (a power of 2) */

Moral - if you really want to optimize this for your special purpose, be prepared to play with <deque> . 道德 - 如果你真的想为了你的特殊目的而优化它,那就准备和<deque>一起玩吧。 Its behaviour depends critically on the size of your elements, and beyond that on the expected usage pattern. 它的行为主要取决于元素的大小,超出预期使用模式的大小。

EDIT: Depending on your knowledge of queue sizes, you might be able to drop in boost::circular_buffer as a replacement for the std::queue container. 编辑:根据您对队列大小的了解,您可以将boost :: circular_buffer作为std :: queue容器的替代品。 I bet this would perform more like you want (and expected). 我打赌这会表现得更像你想要的(和预期的)。

Without looking at the actual implementation of std::queue you're using, my guess is that its memory allocation looks something like this: 在没有查看你正在使用的std :: queue的实际实现的情况下,我的猜测是它的内存分配看起来像这样:

if (new element won't fit) {
    double the size of the backing storage
    realloc the buffer (which will probably copy all elements)
}

The reason for doubling rather than being more conservative is that you want the queue.push_pack operation to have O(1) average time. 加倍而不是更保守的原因是你希望queue.push_pack操作的平均时间为O(1)。 Since the reallocation may copy the existing elements, a version that only grew the array as needed (1 element at a time) would be O(n^2) as you initially push all of your values into the queue. 由于重新分配可能会复制现有元素,因此在您最初将所有值都推入队列时,只需要根据需要生成数组的版本(一次一个元素)将为O(n ^ 2)。 I'll leave it as an exercise for the reader how the doubling version gives constant average time. 我将把它留作读者的练习,如何加倍版本给出恒定的平均时间。

Since you are quoting the size of the entire process, your estimate of about 2x overhead when you push slightly more than a power of 2 (2^26 < 100MM < 2^27) worth of elements seems reasonable. 由于您引用的是整个过程的大小,因此当您略微超过2(2 ^ 26 <100MM <2 ^ 27)值的元素时,估计大约2倍的开销似乎是合理的。 Try stopping at at 2^(n-1), measuring, then pushing a few elements and measuring again. 尝试在2 ^(n-1)处停止,测量,然后按几个元素并再次测量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM