简体   繁体   中英

How to reduce memory size for data in C++?

I am working on C++ and using multimap for storing data.

 struct data
 {
      char* value1;
      char* value2;

      data(char* _value1, char* _value2)
      {
           int len1 = strlen(_value1);
           value1 = new char[len1+1];
           strcpy(value1,_value1);

           int len2 = strlen(_value2);
           value2 = new char[len2+2];
           strcpy(value2,_value2);
      }
      ~data()
      {
           delete[] value1;
           delete[] value2;
      }
 }

 struct ltstr
 {
     bool operator()(const char* s1, const char* s2) const
     {
          return strcmp(s1, s2) < 0;
     }
 };


 multimap <char*, data*, ltstr> m;

Sample Input:

  Key               Value
  ABCD123456        Data_Mining Indent Test Fast Might Must Favor List Myself Janki Jyoti Sepal Petal Catel Katlina Katrina Tesing Must Motor blah blah.
  ABCD123456        Datfassaa_Minifasfngf Indesfsant Tfdasest Fast Might Must Favor List My\\fsad\\\self Jfasfsa Katrifasdna Tesinfasfg Must Motor blah blah.
  tretD152456       fasdfa fasfsaDfasdfsafata_Mafsfining Infdsdent Tdfsest Fast Might Must Favor List Myself Janki

there are 27 million entries in input. Input size = 14GB

But i noticed memory consumption reaches to 56 GB. May i know how can i reduce memory size?

If you can't reduce the amount of data you're actually storing, you might want to try to use a different container with less overhead (map and multimap have quite a bit) or find a way to keep only part of the data in memory.

You might want to take a look at these libraries:

One possibility would be to use a std::map<char *, std::vector<data> > instead of a multimap. In a multimap, you're storing the key string in each entry. With a map you'd have only one copy of the key string, with multiple data items attached to it.

The first optimization would be to store data objects instead of pointers

std::multimap <char*, data, ltstr> m;

because using data* adds additional memory overhead for the allocation.

Another one is using a pool allocator/Memory pool to reduce the footprint of dynamic memory allocation.

If you have many identical key strings, you can improve that too, if you can reuse the keys.

Without seeing some of your data, there are several things that could improve the memory usage of your project.

First, as Olaf suggested, store the data object in the multimap instead of a pointer to it. I don't suggest using a pool for your data structure though, it just complicates things without memory saving compared to directly storing it in the map.

What you could do though is a specialized allocator for your map that allocates std::pair<char*, data> objects. This could save some overhead and heap fragmentation.

Next, the main thing you should focus on is to try to get rid of the two char* pointers in your data. With 14 gigs of data, there has to be some overlap. Depending on what data it is, you could store it a bit differently.

For example, if the data is names or keywords then it would make sense to store them in a central hash. Yes there are more sophisticated solutions like a DAWG as suggested above but I think one should try the simple solutions first.

By simply storing it in a std::set<std::string> and storing the iterator to it you would compact all duplicates which would save a lot of data. This assumes though that you don't remove the strings. removing the strings would require you to do some reference counting so you would use something like std::map<std::string, unsinged long> . I suggest you write a class that inherits from / contains this hash rather then putting the reference counting logic into your data class though.

If the data that you are storing does not have many overlaps however, eg because it's binary data, then I suggest you store it in a std::string or std::vector<char> instead. The reason is because now you can get rid of the logic in your data structure and even replace it with a std::pair .

I'm also assuming that your key is not one of your pointers you are storing in your data structure. If it is, definitely get rid of it and use the first attribute of the std::pair in your multimap.

Further improvements might be possible depending on what type of data you are storing.

So, with a lot of assumptions that probably don't apply to your data you could have as little as this:

typedef std::set<std:string> StringMap;
typedef StringMap::const_iterator StringRef;
typedef std::multimap<StringRef, std::pair<StringRef, StringRef>> DataMap;

I would suspect that you're leaking or unnecessarily duplicating memory in the keys. Where do the key char * strings come from and how do you manage their memory?

If they are the same string(s) as are in the data object, consider using a multiset<data *, ltdata> instead of a multimap .

If there are many duplicate strings, consider pooling strings in a set<char *,ltstr> to eliminate duplicates.

I'm still not entirely sure what is going on here, but it seems that memory overhead is at least some portion of the problem. However, the overall memory consumption is about 4x that which is needed for the data structure. There are approximately 500 bytes per record if there are 27M records taking up 14GB, yet the space taken up is 56GB. To me, this indicates that there is either more data stored than we're shown here, or at least some of the data is stored more than once.

And the "extra data for heap storage" isn't really doing it for me. In linux, a memory allocation takes somewhere around 32 bytes of data minimum. 16 bytes of overhead, and the memory allocated itself takes up a multiple of 16 bytes.

So for one data * record stored in the multimap, we need:

 16 bytes of header for the memory allocation
 8 bytes for pointer of `value1`
 8 bytes for pointer of `value2`
 16 bytes of header for the string in value1
 16 bytes of header for the string in value2
 8 bytes (on average) "size rounding" for string in value 1
 8 bytes (on average) "size rounding" for string in value 2

 ?? bytes from the file. (X)

 80 + X bytes total. 

We then have char * in the multimap:

 16 bytes of header for the memory allocation. 
 8 bytes of rounding on average. 

 ?? bytes from the file. (Y)

 24 + Y bytes total. 

Each node of the multimap will have two pointers (I'm assuming it's some sort of binary tree):

 16 bytes of header for the memory allocation of the node. 
 8 bytes of pointer to "left"
 8 bytes of pointer to "right"

 32 bytes total. 

So, that makes 136 bytes of "overhead" per entry in the file. For 27M records, that is just over 4GB.

The file, as I said, contains 500 bytes per entry, so makes 14GB.

That's a total of 18GB.

So, somewhere, something is either leaking, or the math is wrong. I may be off by my calculations here, but even if everything above takes double the space I've calculated, there's STILL 20GB unaccounted for.

There are certainly some things we could do to save memory:

1) Don't allocate TWO strings in data . Calculate both lengths first, allocate one lump of memory, and store the strings immediately after each other:

  data(char* _value1, char* _value2)
  {
       int len1 = strlen(_value1);
       int len2 = strlen(_value2);
       value1 = new char[len1 + len2 +2];
       strcpy(value1,_value1);

       value2 = value1 + len1 + 1; 
       strcpy(value2,_value2);
  }

That would save on average 24 bytes per entry. We could possibly save even more by being clever and allocating the memory for data, value1 and value2 all at once. But that could be a little "too clever".

2) Allocating a large slab of data items, and doling them out one at a time would also help. For this to work, we need an empty constructor, and a "setvalues" method:

struct data
{
    ...
    data() {};
    ... 
    set_values(char* _value1, char* _value2)
    {
         int len1 = strlen(_value1);
         int len2 = strlen(_value2);
         value1 = new char[len1 + len2 +2];
         strcpy(value1,_value1);

         value2 = value1 + len1 + 1; 
         strcpy(value2,_value2);
    }
}

std::string v1[100], v2[100], key[100];

for(i = 0; i < 100; i++)
{
    if (!read_line_from_file(key[i], v1[i], v2[i]))
    {
        break;
    }
}    

data* data_block = new data[i]; 

for(j = 0; j < i; j++)
{
    data_block[j].setValues[v1[j].c_str(), v2[j].c_str());
    m.insert(key[i].c_str(), &data_block[j]);
}

Again, this wouldn't save a HUGE amount of memory, but each 16 byte region saves SOME memory. The above is of course not complete code, and more of an "illustration of how it could be done".

3) I'm still not sure where the "Key" comes from in the multimap, but if the key is one of the value1 and value2 entries, then you could reuse one of those, rather than storing another copy [assuming that's how it's done currently].

I'm sorry if this isn't a true answer, but I do believe that it is an answer in the sense that "somewhere, something is unaccounted for in your explanation of what you are doing".

Understanding what allocations are made in your program would definitely help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM