简体   繁体   English

大文件上的c ++疯狂的内存消耗

[英]c++ insane memory consumption on large file

I am loading a 10GB file into memory and I find that even if I strip away any extra overhead and store the data in nothing but an array it still takes up 53 GB of ram. 我正在将一个10GB的文件加载到内存中,我发现即使我去除了任何额外的开销并将数据存储在一个数组中,它也仍然占用53 GB的内存。 This seems crazy to me since I am converting some of the text data to longs which take up less room and convert the rest to char * which should take up the same amount of room as a text file. 这对我来说似乎很疯狂,因为我正在将一些文本数据转换为占用较少空间的long型数据,而将其余部分转换为char *,后者应占用与文本文件相同的空间数量。 I have about 150M rows of data in the file I am trying to load. 我要加载的文件中有大约1.5亿行数据。 Is there any reason why this should take up so much ram when I load it the way I do below? 有什么理由为什么当我按照下面的方式加载它时要占用这么多的内存?

There are three files here a fileLoader class and its header file and a main that simply runs them. 这里有三个文件,一个fileLoader类及其头文件,以及一个简单运行它们的主文件。 To answer some questions: OS is UBUNTU 12.04 64bit This is on a machien with 64GB of RAM and an SSD hd that I have providing 64GB of swap space for RAM I am loading all of the data at once becuase of the need for speed. 要回答一些问题:操作系统是UBUNTU 12.04 64位,这是在具有64GB RAM和SSD HD的机器上,我为RAM提供了64GB交换空间,由于速度的原因,我一次加载了所有数据。 It is critical for the application. 这对于应用程序至关重要。 All sorting, indexing, and lots of the data intensive work runs on the GPU. 所有排序,索引编制和大量数据密集型工作都在GPU上运行。 The other reason is that loading all of the data at once made it much simpler for me to write the code. 另一个原因是,一次加载所有数据使我编写代码变得更加简单。 I dont have to worry about indexed files, and mappings to locations in another file for example. 例如,我不必担心索引文件以及到另一个文件中位置的映射。

Here is the header file: 这是头文件:

#ifndef FILELOADER_H_
#define FILELOADER_H_
#include <iostream>
#include <fstream>

#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <string>

class fileLoader {
public:
    fileLoader();
    virtual ~fileLoader();
    void loadFile();
private:
    long long ** longs;
    char *** chars;
    long count;
    long countLines(std::string inFile);
};


#endif /* FILELOADER_H_ */

Here is the CPP file 这是CPP文件

#include "fileLoader.h"



fileLoader::fileLoader() {
    // TODO Auto-generated constructor stub
    this->longs = NULL;
    this->chars = NULL;
}

char ** split(char * line,const char * delim,int size){
    char ** val = new char * [size];


    int i = 0;
    bool parse = true;
    char * curVal = strsep(&line,delim);
    while(parse){


        if(curVal != NULL){
            val[i] = curVal;
            i++;
            curVal = strsep(&line,delim);
        }else{
            parse = false;
        }

    }

    return val;
}

void fileLoader::loadFile(){
    const char * fileName = "/blazing/final/tasteslikevictory";

    std::string fileString(fileName);
    //-1 since theres a header row and we are skipinig it
    this->count = countLines(fileString) -1;

    this->longs = new long long*[this->count];
    this->chars = new char **[this->count];
    std::ifstream inFile;

    inFile.open(fileName);
    if(inFile.is_open()){
        std::string line;
        int i =0;
        getline(inFile,line);
        while(getline(inFile,line)){
            this->longs[i] = new long long[6];
            this->chars[i] = new char *[7];
            char * copy = strdup(line.c_str());
            char ** splitValues = split(copy,"|",13);

            this->longs[i][0] = atoll(splitValues[4]);
            this->longs[i][1] = atoll(splitValues[5]);
            this->longs[i][2] = atoll(splitValues[6]);
            this->longs[i][3] = atoll(splitValues[7]);
            this->longs[i][4] = atoll(splitValues[11]);
            this->longs[i][5] = atoll(splitValues[12]);

            this->chars[i][0] = strdup(splitValues[0]);
            this->chars[i][1] = strdup(splitValues[1]);
            this->chars[i][2] = strdup(splitValues[2]);
            this->chars[i][3] = strdup(splitValues[3]);
            this->chars[i][4] = strdup(splitValues[8]);
            this->chars[i][5] = strdup(splitValues[9]);
            this->chars[i][6] = strdup(splitValues[10]);
            i++;
            delete[] splitValues;
            free(copy);
        }
    }
}

fileLoader::~fileLoader() {
    // TODO Auto-generated destructor stub
    if(this->longs != NULL){
        delete[] this->longs;
    }

    if(this->chars != NULL){
        for(int i =0; i <this->count;i++ ){
            free(this->chars[i]);
        }
        delete[] this->chars;
    }

}

long fileLoader::countLines(std::string inFile){
    int BUFFER_SIZE = 16*1024;
    int fd = open(inFile.c_str(), O_RDONLY);
    if(fd == -1)
    return 0;

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    long lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
    if(bytes_read == (size_t)-1)
        return 0;
    if (!bytes_read)
        break;

    for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
        ++lines;
    }

    return lines;

}

Here is the file with my main function: 这是我主要功能的文件:

#include "fileLoader.h"

int main()
{

fileLoader loader;
loader.loadFile();
return 0;
}

Here is an example of the data that I am loading: 这是我正在加载的数据的示例:

13|0|1|1997|113|1|4|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
14|0|1|1997|113|1|5|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
15|0|1|1997|113|1|6|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
16|0|1|1997|113|1|7|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
17|0|1|1997|113|1|8|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
18|0|1|1997|113|1|9|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
19|0|1|1997|113|1|10|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
20|0|1|1997|113|1|11|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
21|0|1|1997|113|1|12|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
9|0|1|1997|113|1|13|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
27|0|1|1992|125|1|1|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
28|0|1|1992|125|1|2|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
29|0|1|1992|125|1|3|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
30|0|1|1992|125|1|4|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
31|0|1|1992|125|1|5|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
32|0|1|1992|125|1|6|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
33|0|1|1992|125|1|7|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
34|0|1|1992|125|1|8|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
35|0|1|1992|125|1|9|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
36|0|1|1992|125|1|10|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
37|0|1|1992|125|1|11|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
38|0|1|1992|125|1|12|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
39|0|1|1992|125|1|13|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
40|0|1|1992|125|1|14|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
41|0|1|1992|125|1|15|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
10|0|1|1996|126|1|1||||||

You are allocating nine chunks of memory for each line, so you are allocating a total of 1350 million pieces of memory. 您将为每行分配九个内存块,因此您总共分配了13.5亿条内存。 These allocations have a certain overhead, usually at least twice the size of a pointer, possibly even more. 这些分配具有一定的开销,通常至少是指针大小的两倍,甚至可能更大。 On a 64 bit machine, that is already 16 bytes, so you get 21.6 GB of overhead. 在64位计算机上,已经是16个字节,因此您将获得21.6 GB的开销。

In addition to that, you get the overhead of heap fragmentation and alignment: Even if you only ever store a string in it, the allocator has to align the memory allocations so that you can store the largest possible values in it without triggering misalignment. 除此之外,还会产生堆碎片和对齐的开销:即使仅在其中存储一个字符串,分配器也必须对齐内存分配,以便您可以在其中存储最大的值而不会触发未对齐。 Alignment may depend on the vector unit of your CPU, which can require very significant alignments, 16 byte alignment not being uncommon. 对齐方式可能取决于CPU的向量单位,这可能需要非常重要的对齐方式,而16字节对齐方式并不少见。

Doing the calculation with 16 bytes allocation overhead and 16 bytes alignment, we get allocations of 43.2 GB without the original data . 用16个字节的分配开销和16个字节的对齐方式进行计算,得到的分配为43.2 GB, 而没有原始数据 With the original data this calculation is already very close to your measurement. 使用原始数据,该计算已经非常接近您的测量结果。

Each of those objects and strings you create has individual memory management overhead. 您创建的每个对象和字符串都有各自的内存管理开销。 So you load the string "0" from column 2, depending on your memory manager, it probably takes between two and four full words (could be more). 因此,根据您的内存管理器,从列2加载字符串“ 0”,它可能需要两个到四个完整的单词(可能更多)。 Call it 16 to 32 bytes of storage to hold a one byte string. 称其为16到32字节的存储空间以容纳一个字节的字符串。 Then you load the "1" from column 3. And so on. 然后,从第3列加载“ 1”。依此类推。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM