简体   繁体   English

内存中不同对象的大小

[英]Size of different objects in memory

I have like about 100,000 sentences in a List<string> . 我在List<string>喜欢大约100,000个句子。

I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. 我试图用单词拆分这些句子并将所有内容添加到List<List<string>> ,其中每个List包含一个句子,其中包含另一个单词List I'm doing that because I have to do a different work on each individual words. 我这样做是因为我必须对每个单词做不同的工作。 What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory? 只有List<string>的句子与内存中单词的List<List<string>>的大小差异是什么?

One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string 其中一个将最终存储在内存中,所以我正在寻找分裂每个句子与仅仅string的内存影响

So, first off we'll compare the difference in memory between a single string or two strings which, if concatted together, would result in the first: 所以,首先我们将比较单个字符串或两个字符串之间的内存差异,如果连接在一起,将导致第一个字符串:

string first = "ab";

string second = "a";
string third = "b";

How much memory does first use compared to second and third together? secondthird相比, first使用多少内存? Well, the actual characters that they need to reference is the same, but every single string object has a small overhead (14 bytes on a 32 bit system, 26 bytes on a 64 bit system). 好吧,他们需要引用的实际字符是相同的,但是每个字符串对象的开销都很小 (32位系统上为14字节,64位系统上为26字节)。

So for each string that you break up into a List<string> representing smaller strings there is a 14 * (wordsPerSentance - 1) byte overhead. 因此,对于分解为表示较小字符串的List<string>每个字符串,存在14 * (wordsPerSentance - 1)字节开销。

Then there is the overhead for the list itself. 然后是列表本身的开销。 The list will consume one word of memory (32 bits on a 32 bit system, 64 on a 64 bit system, etc.) for each item added to the list plus the overhead of a List<string> itself (which is 24 bytes on a 32 bit system). 对于添加到列表中的每个项目以及List<string>本身的开销(其为24字节),该列表将消耗一个字的内存(32位系统上为32位,64位系统上为64位等)。 32位系统)。

So for that you need to add (on a 32 bit system) (24 + (8 * averageWordsPerSentance)) * numberOfSentances bytes of memory. 因此,您需要添加(在32位系统上) (24 + (8 * averageWordsPerSentance)) * numberOfSentances字节的内存。

We'll start with your List<string> . 我们将从您的List<string> I'm going to assume the 64-bit runtime. 我将假设64位运行时。 Numbers for the 32-bit runtime are slightly smaller. 32位运行时的数字略小。

The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. List本身需要大约32个字节(分配开销,加上内部变量),以及字符串的后备数组。 The array overhead is 50 bytes, and you need 8 bytes per string for the references. 数组开销是50个字节,每个字符串需要8个字节用于引用。 So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array. 因此,如果您有100,000个句子,那么阵列至少需要800,000个字节。

The strings themselves require something like 26 bytes each, plus two bytes per character. 字符串本身需要每个字节26个字节,每个字符加两个字节。 So if your average sentence is 80 characters, you need 186 bytes per string. 因此,如果您的平均句子是80个字符,则每个字符串需要186个字节。 Multiplies by 100K strings, that's about 18.5 megabytes. 乘以100K字符串,即大约18.5兆字节。 Altogether, your list of sentences will take around 20 MB (round number). 总而言之,您的句子列表大约需要20 MB(整数)。

If you split the sentences into words, you now have 100,000 List<string> instances. 如果将句子拆分为单词,则现在有100,000个List<string>实例。 That's about 5 megabytes just for the List<List<string>> . 对于List<List<string>>大约只有5兆字节。 If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). 如果我们假设每个句子10个单词,则每个句子的列表将需要大约80个字节用于后备阵列,加上每个字符串26个字节(总共大约260个字节),加上字符串数据本身(8个字符,或总共160个字节)。 So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. 所以每个句子花费你(再次,圆数)80 + 260 + 160,或500字节。 Multiplied by 100,000 sentences, that's 50 MB. 乘以100,000个句子,即50 MB。

So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes. 因此,非常粗略的数字,将您的句子分成List<List<string>>将占用55或60兆字节。

Unfortunately, this isn't a question that can be answered very easily -- it depends on the particular strings, and what lengths you're willing to go to in order to optimize. 不幸的是,这不是一个可以很容易回答的问题 - 它取决于特定的字符串,以及您为了优化而愿意花多长时间。

For example, take a look at the String.Intern() method. 例如,看看String.Intern()方法。 If you intern all the words, it's possible that the collection of words will require less memory than the collection of sentences. 如果你实习所有单词,那么单词集合可能比句子集合需要更少的内存。 It would depend on the contents. 这取决于内容。 There are other implications to interning, though, so that might not be the best idea. 不过,实习还有其他一些含义,所以这可能不是最好的主意。 Again, it would depend on the particulars of the situation -- check the "Performance Considerations" section of the doc page I linked. 同样,它将取决于具体情况 - 检查我链接的文档页面的“性能注意事项”部分。

I think the best thing to do is to use GC.GetTotalMemory(true) before and after your operation to get a rough idea of how much memory is actually being used. 我认为最好的办法是在操作之前和之后使用GC.GetTotalMemory(true)来大致了解实际使用的内存量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM