简体   繁体   English

收藏和记忆

[英]Collections and memory

I have an application that read 3-4 GB s of data, build entities out of each line and then stores them in Lists. 我有一个应用程序读取3-4 GB的数据,从每行构建实体,然后将它们存储在列表中。

The problem I had is, memory grows insane becomes like 13 to 15 GB. 我遇到的问题是,内存增长疯狂变得像13到15 GB。 Why the heck storing these entities takes so much memory. 为什么存储这些实体会占用大量内存。

So I build a Tree and did something similar to Huffman Encoding, and overall memory size became around 200 - 300 MB. 所以我构建了一个Tree并做了类似于Huffman Encoding的事情,整体内存大小变成了大约200 - 300 MB。

I understand, that I compacted the data. 我明白,我压缩了数据。 But I wasn't expecting that storing objects in the list would increase the memory so much. 但我没想到在列表中存储对象会增加内存。 Why did that happen? 为什么会这样?

how about other data structures like dictionary, stack, queue, array etc? 其他数据结构如字典,堆栈,队列,数组等怎么样?

Where can I find more information about the internals and memory allocations of data structures? 在哪里可以找到有关数据结构的内部和内存分配的更多信息?

Or am I doing something wrong? 或者我做错了什么?

In .NET large objects go on the large object heap which is not compacted. 在.NET中,大对象会进入未压缩的大对象堆。 Large is everything above 85,000 bytes. 大是85,000字节以上的一切。 When you grow your lists they will probably become larger than that and have to be reallocated once you cross the current capacity. 当您增加列表时,它们可能会变得比这更大,并且一旦跨越当前容量就必须重新分配。 Rellocation means that they are very likely put at the end of the heap. Rellocation意味着它们很可能放在堆的末尾。 So you end up with a very fragmented LOH and lots of memory usage. 所以你最终会得到一个非常分散的LOH和大量的内存使用。

Update : If you initialize your lists with the required capacity (which you can determine from the DB I guess) then your memory consumption should go down a bit. 更新 :如果您使用所需的容量初始化列表(我可以从数据库中确定),那么您的内存消耗应该会有所下降。

Regardless of the data structure you're going to use, your memory consumption is never going to drop below the memory required to store all your data. 无论您将要使用哪种数据结构,您的内存消耗都不会低于存储所有数据所需的内存。

Have you calculated how much memory it is required to store one instance class object? 您是否计算了存储一个实例类对象所需的内存量?

Your huffman encoding is a space-saving optimization, which means that you are eliminating a lot of duplicated data within your class objects yourself. 您的霍夫曼编码是一种节省空间的优化,这意味着您自己在类对象中消除了大量重复数据。 This has nothing to do with the data structure you use to hold your data. 这与用于保存数据的数据结构无关。 This depends on how your data itself is structured so that you can take advantage of different space-saving strategies (of which huffman encoding is one out of many possibilities, suitable for eliminating common prefixes and the data structure used to store it is a tree). 这取决于您的数据本身的结构,以便您可以利用不同的节省空间的策略(其中霍夫曼编码是许多可能性中的一种,适用于消除公共前缀,用于存储它的数据结构是树) 。

Now, back to your question. 现在,回到你的问题。 Without optimizing your data (ie objects), there are things you can watch out to improve memory usage efficiency. 在不优化数据(即对象)的情况下,您可以注意一些事项以提高内存使用效率。

Are all our objects of similar size? 我们所有的物体都有相似的尺寸吗?

Did you simply run a loop, allocate memory on-the-fly, then insert them into a list, like this: 您是否只是运行循环,即时分配内存,然后将它们插入到列表中,如下所示:

foreach (var obj in collection) { myList.Add(new myObject(obj)); }

In that case, your list object is constantly being expanded. 在这种情况下,您的列表对象将不断扩展。 And if there is not enough free memory at the end to expand the list, .NET will allocate a new, larger piece of memory and copies the original array to the new memory. 如果最后没有足够的可用内存来扩展列表,.NET将分配一个新的更大的内存并将原始数组复制到新内存中。 Essentially you end up with two pieces of memory -- the original one, and the new expanded one (now holding the list). 基本上你最终得到两块内存 - 原始内存和新扩展内存(现在持有列表)。 Do this many many many times (as you obviously need to for GB's of data), and you are looking at a LOT of fragmented memory spaces. 这么做很多次(因为你显然需要GB的数据),而且你正在寻找很多 碎片化的内存空间。

You'll be better off just allocating enough memory for the entire list at one go. 你可以一次为整个列表分配足够的内存。

As an afternote, I can't help but wondering: how in the world are you going to search this HUGE list to find something you need? 作为一个后记,我不禁想知道:你是如何在世界上要寻找这个巨大的名单上找到您需要的? Shouldn't you be using something like a binary tree or a hash-table to aid in your searching? 你不应该使用像二叉树或哈希表这样的东西来帮助你搜索吗? Maybe you are just reading in all the data, perform some processing on all of them, then writing them back out... 也许您只是阅读所有数据,对所有数据执行一些处理,然后将它们写回...

If you are using classes, read the response of this: Understanding CLR object size between 32 bit vs 64 bit 如果您正在使用类,请阅读以下响应: 了解32位与64位之间的CLR对象大小

On 64 bits (you are using 64 bits, right?) object overhead is 16 bytes PLUS the reference to the object (someone is referencing him, right?) so another 8 bytes. 在64位(你使用64位,对吗?)对象开销是16个字节加上对象的引用(有人引用他,对吗?)所以另外8个字节。 So an empty object will "eat" at least 24 bytes. 所以一个空对象将“吃掉”至少24个字节。

If you are using List s, remember that List s grow by doubling, so you could be wasting much space. 如果你正在使用List ,请记住List通过加倍增长,所以你可能会浪费很多空间。 Other .NET collections grow in the same way. 其他.NET集合以相同的方式增长。

I'll add that the "pure" overhead of million of List s could bring the memory to his knees. 我将补充说,百万名List的“纯粹”开销可能会让记忆瘫痪。 Other than the 16 + 8 bytes of space "eaten" by the List object, it is composed (in the .NET implementation) of 2 ints (8 bytes), a SyncLock reference (8 bytes, it's null normally) and a reference to the internal array (so 8 + 16 bytes + the array) 除了List对象“吃掉”的16 + 8字节空间外,它由2个整数(8个字节)组成(在.NET实现中),一个SyncLock引用(8个字节,通常为空)和一个引用内部数组(所以8 + 16字节+数组)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM