简体   繁体   中英

Size of different objects in memory

I have like about 100,000 sentences in a List<string> .

I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. I'm doing that because I have to do a different work on each individual words. What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory?

One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string

So, first off we'll compare the difference in memory between a single string or two strings which, if concatted together, would result in the first:

string first = "ab";

string second = "a";
string third = "b";

How much memory does first use compared to second and third together? Well, the actual characters that they need to reference is the same, but every single string object has a small overhead (14 bytes on a 32 bit system, 26 bytes on a 64 bit system).

So for each string that you break up into a List<string> representing smaller strings there is a 14 * (wordsPerSentance - 1) byte overhead.

Then there is the overhead for the list itself. The list will consume one word of memory (32 bits on a 32 bit system, 64 on a 64 bit system, etc.) for each item added to the list plus the overhead of a List<string> itself (which is 24 bytes on a 32 bit system).

So for that you need to add (on a 32 bit system) (24 + (8 * averageWordsPerSentance)) * numberOfSentances bytes of memory.

We'll start with your List<string> . I'm going to assume the 64-bit runtime. Numbers for the 32-bit runtime are slightly smaller.

The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. The array overhead is 50 bytes, and you need 8 bytes per string for the references. So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array.

The strings themselves require something like 26 bytes each, plus two bytes per character. So if your average sentence is 80 characters, you need 186 bytes per string. Multiplies by 100K strings, that's about 18.5 megabytes. Altogether, your list of sentences will take around 20 MB (round number).

If you split the sentences into words, you now have 100,000 List<string> instances. That's about 5 megabytes just for the List<List<string>> . If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. Multiplied by 100,000 sentences, that's 50 MB.

So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes.

Unfortunately, this isn't a question that can be answered very easily -- it depends on the particular strings, and what lengths you're willing to go to in order to optimize.

For example, take a look at the String.Intern() method. If you intern all the words, it's possible that the collection of words will require less memory than the collection of sentences. It would depend on the contents. There are other implications to interning, though, so that might not be the best idea. Again, it would depend on the particulars of the situation -- check the "Performance Considerations" section of the doc page I linked.

I think the best thing to do is to use GC.GetTotalMemory(true) before and after your operation to get a rough idea of how much memory is actually being used.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM