在 C++ 中读取大型 CSV 文件的性能问题

Question

I need to read in many big CSV file to process in C++ (range from few MB to hundreds MB) At first, I open with fstream, use getline to read each line and use the following function to split each row"我需要读取许多大 CSV 文件以在 C++ 中处理（范围从几 MB 到数百 MB）首先，我使用 fstream 打开，使用 getline 读取每一行并使用以下函数拆分每一行”

template < class ContainerT >
void split(ContainerT& tokens, const std::string& str, const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();

using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;

while (lastPos < length + 1)
{
    pos = str.find_first_of(delimiters, lastPos);
    if (pos == std::string::npos)
    {
        pos = length;
    }

    if (pos != lastPos || !trimEmpty)
        tokens.push_back(value_type(str.data() + lastPos,
        (size_type)pos - lastPos));

    lastPos = pos + 1;
}
}

I tried boost::split,boost::tokenizer and boost::sprint and find the above give the best performance so far.我尝试了 boost::split、boost::tokenizer 和 boost::sprint 并发现以上提供了迄今为止最好的性能。 After that, I consider read in the whole file into memory to process rather than keep the file opened, I use the following function to read in the whole file with the following function:之后，我考虑将整个文件读入内存进行处理，而不是保持文件处于打开状态，我使用以下函数将整个文件读入以下函数：

void ReadinFile(string const& filename, stringstream& result)
{
ifstream ifs(filename, ios::binary | ios::ate);
ifstream::pos_type pos = ifs.tellg();

//result.resize(pos);
char * buf = new char[pos];
ifs.seekg(0, ios::beg);
ifs.read(buf, pos);
result.write(buf,pos);
delete[]buf;

}

Both functions are copied somewhere from the net.这两个函数都是从网络的某个地方复制的。 However, I find that there is not much difference in performance between keep file opened or read in the whole file.但是，我发现在整个文件中保持文件打开或读取之间的性能没有太大差异。 The performance capture as follow:性能捕获如下：

Process 2100 files with boost::split (without read in whole file) 832 sec
Process 2100 files with custom split (without read in whole file) 311 sec
Process 2100 files with custom split (read in whole file) 342 sec

Below please find the sample content of one type of file(s), I have 6 types need to handle.下面请找到一种类型文件的示例内容，我有 6 种类型需要处理。 But all are similar.但都是相似的。

a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0

My questions are:我的问题是：

1 Why read in whole file will perform worse than not read in whole file ? 1 为什么读取整个文件会比不读取整个文件性能更差？

2 Any other better string split function? 2 任何其他更好的字符串拆分功能？

3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? 3 ReadinFile 函数需要读取缓冲区然后写入字符串流进行处理，有什么方法可以避免这种情况？ ie directly into stringstream即直接进入stringstream

4 I need to use getline to parse each line (with \\n) and use split to tokenize each row, any function similar for getline for string ? 4 我需要使用 getline 来解析每一行（使用 \\n）并使用 split 来标记每一行，任何类似于 getline for string 的函数？ eg getline_str ?例如 getline_str ？ so that I can read into string directly这样我就可以直接读入字符串

5 How about read the whole file into a string and then split the whole string into vector with '\\n' and then split each string in vector with ',' to process ? 5 如何将整个文件读入一个字符串，然后将整个字符串用'\\n'分割成向量，然后用'，'分割向量中的每个字符串来处理？ Will this perform better ?这会表现得更好吗？ And what is the limit (max size) of string ?字符串的限制（最大大小）是多少？

6 Or I should define a struct like this (based on the format) 6 或者我应该定义一个这样的结构（基于格式）

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

and read directly into a vector ?并直接读入向量？ How to do this ?这该怎么做？

Thanks a lot.非常感谢。

Regds注册

LAM Chi-fung林志峰

Answer 1

Whenever you have to care about performance, it's good to try alternatives and measure their performance.每当您必须关心性能时，最好尝试替代方案并衡量它们的性能。 Some help implementing one option you ask about in your question below....一些帮助实施您在下面的问题中询问的一个选项......

Given each structure you want to read, such as your example...给定您想要阅读的每个结构，例如您的示例...

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

...you can read and parse the fields using fscanf . ...您可以使用fscanf读取和解析字段。 Unfortunately, it's a C library function that doesn't support std::string s, so you'll need to a create character array buffer for each string field then copy from there to your structure's field.不幸的是，它是一个不支持std::string的 C 库函数，因此您需要为每个字符串字段创建一个字符数组缓冲区，然后从那里复制到您的结构字段。 All up, something like:总而言之，例如：

char Item1[4096];
MyStruct m;
std::vector<MyStruct> myStructs;
FILE* stream = fopen(filename, "r");
assert(stream);
while (fscanf(stream, "%[^,],%d,%d,%f,%d,%d,%d,%d...",
              Item1, &m.It2_3[0], &m.It2_3[1], &m.It4,
              &m.ItRemain[0], &m.ItRemain[1], &m.ItRemain[2], ...) == 27)
{
    myStructs.push_back(m);
    myStructs.back().Item1 = Item1;  // fix the std::strings
}
fclose(stream);

(just put the right number of %d s in the format string and complete the other ItRemain indices). （只需在格式字符串中放入正确数量的%d并完成其他ItRemain索引）。

Separately, I'm relucatant to recommend it as it's more advanced programming you may struggle with, but memory mapping the file and writing your own parsing has a good chance of being several times than the fscanf approach above (but again, you won't know until it's measured on your hardware).另外，我不愿意推荐它，因为它是您可能会遇到的更高级的编程，但是内存映射文件和编写自己的解析很有可能是上面fscanf方法的几倍（但同样，您不会知道直到它在您的硬件上测量）。 If you're a scientist trying to do something serious, maybe pair with a professional programmer to get this done for you.如果你是一名科学家，试图做一些严肃的事情，也许可以与专业程序员结对，为你完成这件事。

Answer 2

One basic consideration when trying to craft a fast input routine is to avoid reading and handling each character from the file more than once.尝试制作快速输入例程时的一个基本考虑是避免多次读取和处理文件中的每个字符。 Granted this is not possible when converting to a numeric value as the conversion routine will rescan the characters, but on balance that is the goal.当然，在转换为数值时这是不可能的，因为转换例程将重新扫描字符，但总的来说这是目标。 You should also try and limit the number of function calls and as much overhead as possible.您还应该尝试限制函数调用的数量和尽可能多的开销。 When manipulating fields greater than 16-32 chars, the string and conversion function optimization will almost always outperform what you write on your own, but for smaller fields -- that's not always true.当处理大于 16-32 个字符的字段时，字符串和转换函数优化几乎总是优于您自己编写的，但对于较小的字段 - 并非总是如此。

As far as buffer size goes, the C/C++ library will provide a default read buffer derived from IO_BUFSIZ in the gcc source.就缓冲区大小而言，C/C++ 库将提供从 gcc 源代码中的IO_BUFSIZ派生的默认读取缓冲区。 The constant is available as BUFSIZ in C/C++.该常量在 C/C++ 中可用作BUFSIZ 。 (with gcc it is 8192 bytes, with VS cl.exe it is 512 bytes), So when reading from the file, the I/O functions will have BUFSIZ chars available for use without going back to the disk. （ gcc为8192字节，VS cl.exe为512字节），因此当从文件中读取时，I/O 函数将有可用的BUFSIZ字符而无需返回磁盘。 You can use this to your advantage.您可以充分利用这一点。 So whether you are processing a character at a time, or reading from the file into a 100k sized buffer, the number of disk I/O calls would be the same.因此，无论您是一次处理一个字符，还是将文件读入 100k 大小的缓冲区，磁盘 I/O 调用的次数都是相同的。 (this is a bit counter-intuitive) （这有点违反直觉）

Reading into a buffer, and then calling strtok or sscanf are efficient, but when trying to eek every bit of speed out of your read, both involve traversing the characters you have already read, at a minimum, a second time, and with the conditionals and checks both provide -- you may be able to do a bit better.读入缓冲区，然后调用strtok或sscanf是有效的，但是当试图从读取中获得每一点速度时，两者都涉及遍历您已经读取的字符，至少，第二次，并使用条件和检查都提供 - 您可能可以做得更好。

I agree with Tony's answer whole-heartedly, you will just have to try different approaches, timing each, to determine what combinations will work best for your data.我全心全意地同意托尼的回答，您只需要尝试不同的方法，对每种方法进行计时，以确定哪种组合最适合您的数据。

In looking at your data, being a short char label and then mixed float and int values (of 1000 or less) to the end of each record, one optimization that comes to mind is to simply handle the label, and then treat the remaining values as float .在查看您的数据时，作为一个短char label ，然后将float和int值（ 1000或更少）混合到每条记录的末尾，想到的一种优化是简单地处理标签，然后处理剩余的值作为float 。 The float representation of integers will be exact over the range of your values, so you can essentially handle the read and conversion (and storage) in a simplified form.整数的float表示将在您的值范围内精确，因此您基本上可以以简化的形式处理读取和转换（和存储）。

Assuming you do not know the number of records you have, nor the number of fields you have following each label , you need to start with a fairly generic read that will dynamically allocate storage for records, as required, and also, for the first record, allocate storage for as many fields as may be required until you have determined the number of fields in each record -- from that point on, you can allocate for an exact number of fields for each record -- and validate that each record has the same number of fields.假设您不知道您拥有的记录数，也不知道每个label后面的字段数，您需要从一个相当通用的读取开始，该读取将根据需要为记录动态分配存储空间，也为第一条记录分配存储空间，为尽可能多的字段分配存储空间，直到您确定每条记录中的字段数——从那时起，您可以为每条记录分配确切数量的字段——并验证每条记录是否具有相同数量的字段。

Since you are looking for speed, a simple C routine to read and allocate storage may provide advantages of the C++ implementation (it will certainly minimize the allocation for storage).由于您正在寻找速度，一个简单的 C 例程来读取和分配存储可能会提供 C++ 实现的优势（它肯定会最小化存储分配）。

As a first attempt, I would approach the reading of the file with a character-oriented function like fgetc relying on the underlying BUFSIZ read-buffer to efficiently handle the disk I/O, and then simply write a state-loop to parse the values from each record into a stuct for storage.作为第一次尝试，我将使用character-oriented函数（如fgetc读取文件，依靠底层BUFSIZ读取缓冲区来有效地处理磁盘 I/O，然后简单地编写一个状态循环来解析值从每条记录到一个stuct进行存储。

A short example for you to test and compare with your other routines would be similar to the following.供您测试并与其他例程进行比较的简短示例类似于以下内容。 If you are on a Unix/Linux box, you can use clock_gettime for nanosecond timing, on windows, you will need QueryPerformanceCounter for microsecond timing.如果您在 Unix/Linux 机器上，您可以使用clock_gettime进行纳秒计时，在 Windows 上，您将需要QueryPerformanceCounter进行微秒计时。 The read routine itself could be:读取例程本身可以是：

#include <stdio.h>
#include <stdlib.h>     /* for calloc, strtof */
#include <string.h>     /* for memset */
#include <errno.h>      /* strtof validation */

#define LABEL      3    /* label length (+1 for nul-character */
#define NRECS      8    /* initial number of records to allocate */
#define NFLDS  NRECS    /* initial number of fields to allocate */
#define FLDSZ     32    /* max chars per-field (to size buf) */

typedef struct {
    char label[LABEL];  /* label storage */
    float *values;      /* storage for remaining values */
} record_t;

/* realloc function doubling size allocated */
void *xrealloc (void *ptr, size_t psz, size_t *nelem);

int main (int argc, char **argv) {

    int lblflag = 1, n = 0; /* label flag, index for buf */
    size_t col = 0,         /* column index */
           idx = 0,         /* record index */
           ncol = 0,        /* fixed number of cols - 1st rec determines */
           nflds = NFLDS,   /* tracks no. of fields allocated per-rec */
           nrec = NRECS;    /* tracks no. of structs (recs) allocated */
    char buf[FLDSZ] = "";   /* fixed buffer for field parsing */
    record_t *rec = NULL;   /* pointer to record_t structs */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin; /* file or stdin */

    if (!fp) {  /* validate file open for reading */
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    /* allocate/validate initial storage for nrec record_t */
    if (!(rec = calloc (nrec, sizeof *rec))) {
        perror ("calloc-rec");
        return 1;
    }

    /* allocate/validate initial storage for nflds values */
    if (!(rec[idx].values = calloc (nflds, sizeof *rec[idx].values))) {
        perror ("calloc-rec[idx].values");
        return 1;
    }

    for (;;) {                          /* loop continually until EOF */
        int c = fgetc (fp);             /* read char */
        if (c == EOF)                   /* check EOF */
            break;
        if (c == ',' || c == '\n') {    /* field separator or \n reached */
            char *p = buf;              /* ptr for strtof validation */
            buf[n] = 0;                 /* nul-terminate buf */
            n = 0;                      /* reset buf index zero */
            if (!lblflag) {             /* not lblflag (for branch prediction) */
                errno = 0;              /* zero errno */
                rec[idx].values[col++] = strtof (buf, &p);  /* convert buf */
                if (p == buf) {     /* p == buf - no chars converted */
                    fputs ("error: no characters converted.\n", stderr);
                    return 1;
                }
                if (errno) {        /* if errno - error during conversion */
                    perror ("strof-failed");
                    return 1;
                }
                if (col == nflds && !ncol)  /* realloc cols for 1st row a reqd */
                    rec[idx].values = xrealloc (rec[idx].values, 
                                            sizeof *rec[idx].values, &nflds);
            }
            else {                      /* lblflag set */
                int i = 0;
                do {    /* copy buf - less than 16 char, loop faster */
                    rec[idx].label[i] = buf[i];
                } while (buf[i++]);
                lblflag = 0;            /* zero lblflag */
            }
            if (c == '\n') {        /* if separator was \n */
                if (!ncol)          /* 1st record, set ncol from col */
                    ncol = col;
                if (col != ncol) {  /* validate remaining records against ncol */
                    fputs ("error: unequal columns in file.\n", stderr);
                    return 1;
                }
                col = 0;            /* reset col = 0 */
                lblflag = 1;        /* set lblflag 1 */
                idx++;              /* increment record index */
                if (idx == nrec)    /* check if realloc required */
                    rec = xrealloc (rec, sizeof *rec, &nrec);
                /* allocate values for next record based on now set ncol */
                if (!(rec[idx].values = calloc (ncol, sizeof *rec[idx].values))) {
                    perror ("calloc-rec[idx].values");
                    return 1;
                }
            }
        }
        else if (n < FLDSZ) /* normal char - check index will fit */
            buf[n++] = c;   /* add char to buf */
        else {  /* otherwise chars exceed FLDSZ, exit, fix */
            fputs ("error: chars exceed FLDSZ.\n", stdout);
        }
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */
    /* add code to handle last field on non-POSIX EOF here */
    if (!*rec[idx].label) free (rec[idx].values);  /* free unused last alloc */

    printf ("records: %zu   cols: %zu\n\n", idx, ncol); /* print stats */

    for (size_t i = 0; i < idx; i++) {      /* output values (remove) */
        fputs (rec[i].label, stdout);
        for (size_t j = 0; j < ncol; j++)
            printf (" %3g", rec[i].values[j]);
        free (rec[i].values);               /* free values */
        putchar ('\n');
    }
    free (rec);     /* free structs */

    return 0;
}

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

Example Use/Output示例使用/输出

$ ./bin/readlargecsvbuf dat/large.csv
records: 10   cols: 26

a1   1   1 3.5   5   1   1   1   0   0   6   0 155  21 142  22  49   1   9   1   0   0   0   0   0   0   0
a1  10   2   5   5   1   1   2   0   0  12   0  50  18 106  33 100  29  45   9   8   0   1   1   0   0   0
a1  19   3   5   5   1   1   3   0   0  18   0  12  12  52  40  82  49  63  41  23  16   8   2   0   0   0
a1  28   4 5.5   5   1   1   4   0   0  24   0   2   3  17  16  53  53  63  62  43  44  18  22   4   0   4
a1  37   5   3   5   1   1   5   0   0   6   0 157  22 129  18  57  11   6   0   0   0   0   0   0   0   0
a1  46   6 4.5   5   1   1   6   0   0  12   0  41  19 121  31  90  34  37  15   6   4   0   2   0   0   0
a1  55   7 5.5   5   1   1   7   0   0  18   0  10   9  52  36  86  43  67  38  31  15   5   7   1   0   1
a1  64   8 5.5   5   1   1   8   0   0  24   0   0   3  18  23  44  55  72  57  55  43   8  19   1   2   3
a1  73   9 3.5   5   1   1   9   1   0   6   0 149  17 145  21  51   8   8   1   0   0   0   0   0   0   0
a1  82  10 4.5   5   1   1  10   1   0  12   0  47  17 115  35  96  36  32  10   8   3   1   0   0   0   0

This may or may not be significantly faster than what you are using, but it would be worth a comparison -- as I suspect it may provide a bit of improvement.这可能会或可能不会比您使用的快得多，但值得进行比较 - 因为我怀疑它可能会提供一些改进。

Answer 3

Eventually I use memory mapped file to solve my problem, performance is much better than I use fscanf Since I work on MS Windows, so I use Stephan Brumme's "Portable Memory Mapping C++ Class" http://create.stephan-brumme.com/portable-memory-mapping/ Since I don't need to deal with file(s) > 2 GB, My implementation is simpler.最终我使用内存映射文件来解决我的问题，性能比我使用 fscanf 好得多因为我在 MS Windows 上工作，所以我使用 Stephan Brumme 的“便携式内存映射 C++ 类” http://create.stephan-brumme.com/便携式内存映射/因为我不需要处理大于 2 GB 的文件，所以我的实现更简单。 For over 2GB file, visit the web to see how to handle.对于超过 2GB 的文件，请访问网络查看如何处理。

Below please find my piece of code:下面请找到我的一段代码：

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

With this piece of code, I handle 2100 files (6.3 GB) in 57 seconds!!!使用这段代码，我在 57 秒内处理了 2100 个文件（6.3 GB）！！！ (I code the CSV format in it and only grab 4 values from each line). （我在其中编写了 CSV 格式，并且每行只获取 4 个值）。 Thx all people's help, you all inspire me in solveing this problem.感谢所有人的帮助，你们都激励我解决这个问题。

Answer 4

Mainly you want to avoid copying.主要是你想避免复制。

If you can afford the memory to load the whole file into an array, then use that array directly, don't convert it back to a stringstream, as that makes another copy, just process the big buffer!如果您能负担得起将整个文件加载到数组中的内存，则直接使用该数组，不要将其转换回字符串流，因为这会生成另一个副本，只需处理大缓冲区即可！

On the other hand that requires your machine to free up adequate RAM for your allocation, and possibly page some RAM to disk, which will be slow to process.另一方面，这需要您的机器为您的分配释放足够的 RAM，并且可能将一些 RAM 分页到磁盘，这会很慢处理。 The alternative is to load your file in large chunks, identify the lines in that chunk and only copy down the part-line at the end of the chunk before loading the next portion of file to concatenate to that part-line (a wrap and read).另一种方法是以大块加载您的文件，识别该块中的行，并在加载文件的下一部分以连接到该部分行之前仅复制块末尾的部分行（换行并读取）。

Another option is that most operating systems provide a memory-mapped file view which means the OS does the file copying for you.另一种选择是大多数操作系统都提供内存映射文件视图，这意味着操作系统会为您复制文件。 These are more constrained (you have to use fixed block sizes and offsets) but will be faster.这些受到更多限制（您必须使用固定的块大小和偏移量），但速度会更快。

You can use methods like strtok_r to split your file into lines, and lines into fields, though you need to deal with escaped field markers - you need to do that anyway.您可以使用 strtok_r 之类的方法将文件拆分为行，并将行拆分为字段，但您需要处理转义的字段标记 - 无论如何您都需要这样做。 It is possible to write a tokeniser that works like strtok but returns string_view-like ranges instead of actually inserting null bytes.可以编写一个类似于 strtok 的标记器，但返回类似 string_view 的范围，而不是实际插入空字节。

Finally you may need to convert some of the field strings to numeric forms or otherwise interpret them.最后，您可能需要将某些字段字符串转换为数字形式或以其他方式解释它们。 Ideally don't use istringstream, as that makes another copy of the string.理想情况下不要使用 istringstream，因为它会生成另一个字符串副本。 If you must, perhaps craft your own streambuf that uses the string_view directly, and attach it to an istream?如果必须，也许可以制作自己的直接使用 string_view 的流缓冲，并将其附加到 istream？

So this should significantly reduce the amount of data copying going on, and should see a speed-up.所以这应该会显着减少正在进行的数据复制量，并且应该会看到加速。

Be aware that you can only directly access the fields and lines that you have in your file read window.请注意，您只能直接访问文件读取窗口中的字段和行。 When you wrap and read any references you have into that data are useless rubbish.当您包装并阅读您对该数据的任何引用时，这些引用都是无用的垃圾。

Answer 5

1 Why read in whole file will perform worse than not read in whole file ? 1 为什么读取整个文件会比不读取整个文件性能更差？

Three words: locality of reference .三个字：参考地点。

On-chip operations of modern CPUs are ridiculously fast, to the point where in many situations the number of CPU-cycles a program requires to execute has only a very small effect on the overall performance of a program.现代 CPU 的片上操作快得离谱，以至于在许多情况下，程序需要执行的 CPU 周期数对程序的整体性能只有很小的影响。 Instead, often the time it takes to complete a task is mostly or totally determined by the speed at which the RAM subsystem can supply data to the CPU, or (even worse) the speed at which the hard disk can supply data to the RAM subsystem.相反，完成一项任务所需的时间通常主要或完全取决于 RAM 子系统向 CPU 提供数据的速度，或者（甚至更糟）硬盘向 RAM 子系统提供数据的速度.

Computer designers try to hide the giant discrepancy between CPU-speed and RAM-speed (and the further giant discrepancy between RAM-speed and disk-speed) through caching ;计算机设计者试图通过缓存隐藏 CPU 速度和 RAM 速度之间的巨大差异（以及 RAM 速度和磁盘速度之间的进一步巨大差异）； for example, when a CPU first wants to access data on a particular 4kB page of RAM, it's going to have to sit and twiddle its thumbs for (what seems to the CPU) a very long time before that data is delivered from RAM to the CPU.例如，当 CPU 第一次想要访问 RAM 的特定 4kB 页上的数据时，在数据从 RAM 传送到中央处理器。 But after that first painful wait, the second CPU-access to nearby data within that same page of RAM will be quite fast, because at that point the page is cached within the CPU's on-chip cache and the CPU no longer has to wait for it to be delivered.但是在第一次痛苦的等待之后，第二个 CPU 访问同一 RAM 页面内的附近数据将非常快，因为此时页面被缓存在 CPU 的片上缓存中，CPU 不再需要等待它要交付。

But the CPU's on-chip caches are (relatively) small -- nowhere near large enough to fit an entire 100+MB file.但是 CPU 的片上缓存（相对）很小——远不足以容纳整个 100+MB 的文件。 So when you load a giant file into RAM, you're forcing the CPU to do two passes across a large area of memory -- the first pass to read all the data in, and then second pass when you go back to parse all the data.因此，当您将一个巨大的文件加载到 RAM 中时，您将迫使 CPU 在大内存区域中执行两次传递——第一次读取所有数据，然后当您返回解析所有数据时进行第二次传递数据。

Assuming your program is RAM-bandwidth limited (and for this simple parsing task it definitely should be), that means two scans across the data will take roughly twice as long as doing everything within a single scan.假设您的程序受 RAM 带宽限制（对于这个简单的解析任务，它绝对应该是），这意味着对数据进行两次扫描所需的时间大约是在一次扫描中完成所有操作的两倍。

2 Any other better string split function? 2 任何其他更好的字符串拆分功能？

I've always kind of liked strtok() , since you can be pretty confident it's not going to do anything inefficient (like call malloc()/free()) behind your back.我一直有点喜欢strtok() ，因为你可以非常自信它不会在你背后做任何低效的事情（比如调用 malloc()/free()）。 Or if you wanted to get really crazy, you could write your own mini-parser using a char * pointer and a for-loop, although I doubt it would end up being noticeably faster than a strtok() -based loop anyway.或者，如果您想变得非常疯狂，您可以使用char *指针和 for 循环编写自己的迷你解析器，尽管我怀疑它最终会比基于strtok()的循环明显快得多。

3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? 3 ReadinFile 函数需要读取缓冲区然后写入字符串流进行处理，有什么方法可以避免这种情况？ ie directly into stringstream即直接进入stringstream

I'd say just put a while()-loop around fgets() , and then after each call to fgets() has read in a line of CSV-text, have an inner while()-loop around strtok() to parse out the fields within that line.我想说只是在fgets()周围放一个 while() 循环，然后在每次调用fgets()读取一行 CSV 文本之后，在strtok()周围有一个内部 while() 循环来解析出该行内的字段。 For maximum efficiency, it's hard to go wrong with good old-fashioned C-style I/O.为了获得最大的效率，好的老式 C 风格 I/O 很难出错。

5 How about read the whole file into a string and then split the whole string into vector with '\\n' and then split each string in vector with ',' to process ? 5 如何将整个文件读入一个字符串，然后将整个字符串用'\\n'分割成向量，然后用'，'分割向量中的每个字符串来处理？ Will this perform better ?这会表现得更好吗？ And what is the limit (max size) of string ?字符串的限制（最大大小）是多少？

I seriously doubt you would get better performance doing that.我严重怀疑你会得到更好的表现。 The string class is not really designed to operate efficiently on multi-megabyte strings.字符串类并不是真正设计为在多兆字节字符串上有效运行。

6 Or I should define a struct like this (based on the format) [...] and read directly into a vector ? 6 或者我应该定义一个这样的结构（基于格式）[...] 并直接读入向量？ How to do this ?这该怎么做？

Yes, that's a good idea -- if you can do everything in a single pass you will come out ahead, efficiency-wise.是的，这是个好主意——如果你能一次完成所有事情，你就会在效率方面领先。 You should just be able to declare (eg) a vector<struct MyStruct> and for each line you parse in the file, write the parsed values into a MyStruct object as you are parsing them (eg with atoi() ), and then after the MyStruct object is fully populated/written-to, push_back(myStruct) to the end of the vector.您应该能够声明（例如） vector<struct MyStruct>并且对于您在文件中解析的每一行，在解析它们时将解析的值写入MyStruct对象（例如使用atoi() ），然后在MyStruct对象被完全填充/写入， push_back(myStruct)到向量的末尾。

(The only thing faster than that would be to get rid of the vector<struct MyStruct> as well, and just do (whatever it is you need to do) with the data right there inside your parsing-loop, without bothering to store the entire data set in a big vector at all. That could be an option eg if you just needed to calculate the sum of all the items in each field, but OTOH it may not be possible for your use-case) （唯一比这更快的方法是去掉vector<struct MyStruct> ，并且只需在解析循环中处理数据（无论您需要做什么），而无需费心存储整个数据集都在一个大向量中。这可能是一个选项，例如，如果您只需要计算每个字段中所有项目的总和，但 OTOH 可能无法用于您的用例）

Answer 6

What you need is memory-mapping.您需要的是内存映射。

You can find more here .您可以在此处找到更多信息。

在 C++ 中读取大型 CSV 文件的性能问题

问题描述

6 个解决方案

解决方案1
3 已采纳 2018-04-25 09:46:07

解决方案2
1 2018-04-25 20:22:08

解决方案3
1 2018-04-26 14:32:30

解决方案4
0 2018-04-24 15:32:31

解决方案5
0 2018-04-25 06:01:41

解决方案6
0 2019-02-11 06:31:43

在 C++ 中读取大型 CSV 文件的性能问题

问题描述

6 个解决方案

解决方案1 3 已采纳 2018-04-25 09:46:07

解决方案2 1 2018-04-25 20:22:08

解决方案3 1 2018-04-26 14:32:30

解决方案4 0 2018-04-24 15:32:31

解决方案5 0 2018-04-25 06:01:41

解决方案6 0 2019-02-11 06:31:43

解决方案1
3 已采纳 2018-04-25 09:46:07

解决方案2
1 2018-04-25 20:22:08

解决方案3
1 2018-04-26 14:32:30

解决方案4
0 2018-04-24 15:32:31

解决方案5
0 2018-04-25 06:01:41

解决方案6
0 2019-02-11 06:31:43