简体   繁体   English

用C ++实现快速,简单的CSV解析

[英]Fast, Simple CSV Parsing in C++

I am trying to parse a simple CSV file, with data in a format such as: 我试图解析一个简单的CSV文件,其格式如下:

20.5,20.5,20.5,0.794145,4.05286,0.792519,1
20.5,30.5,20.5,0.753669,3.91888,0.749897,1
20.5,40.5,20.5,0.701055,3.80348,0.695326,1

So, a very simple and fixed format file. 所以,一个非常简单和固定的格式文件。 I am storing each column of this data into a STL vector. 我将这些数据的每一列存储到STL向量中。 As such I've tried to stay the C++ way using the standard library, and my implementation within a loop looks something like: 因此我尝试使用标准库保持C ++方式,并且我在循环中的实现看起来像:

string field;
getline(file,line);
stringstream ssline(line);

getline( ssline, field, ',' );
stringstream fs1(field);
fs1 >> cent_x.at(n);

getline( ssline, field, ',' );
stringstream fs2(field);
fs2 >> cent_y.at(n);

getline( ssline, field, ',' );
stringstream fs3(field);
fs3 >> cent_z.at(n);

getline( ssline, field, ',' );
stringstream fs4(field);
fs4 >> u.at(n);

getline( ssline, field, ',' );
stringstream fs5(field);
fs5 >> v.at(n);

getline( ssline, field, ',' );
stringstream fs6(field);
fs6 >> w.at(n);

The problem is, this is extremely slow (there are over 1 million rows per data file), and seems to me to be a bit inelegant. 问题是,这非常慢(每个数据文件有超过100万行),在我看来有点不优雅。 Is there a faster approach using the standard library, or should I just use stdio functions? 有没有更快的方法使用标准库,或者我应该只使用stdio函数? It seems to me this entire code block would reduce to a single fscanf call. 在我看来,整个代码块将减少为单个fscanf调用。

Thanks in advance! 提前致谢!

Using 7 string streams when you can do it with just one sure doesn't help wrt. 使用7个字符串流时只需一个肯定无法帮助wrt。 performance. 性能。 Try this instead: 试试这个:

string line;
getline(file, line);

istringstream ss(line);  // note we use istringstream, we don't need the o part of stringstream

char c1, c2, c3, c4, c5;  // to eat the commas

ss >> cent_x.at(n) >> c1 >>
      cent_y.at(n) >> c2 >>
      cent_z.at(n) >> c3 >>
      u.at(n) >> c4 >>
      v.at(n) >> c5 >>
      w.at(n);

If you know the number of lines in the file, you can resize the vectors prior to reading and then use operator[] instead of at() . 如果知道文件中的行数,可以在读取之前调整向量的大小,然后使用operator[]而不是at() This way you avoid bounds checking and thus gain a little performance. 这样就可以避免边界检查,从而获得一点性能。

I believe the major bottleneck (put aside the getline()-based non-buffered I/O) is the string parsing. 我认为主要的瓶颈(放弃基于getline()的非缓冲I / O)是字符串解析。 Since you have the "," symbol as a delimiter, you may perform a linear scan over the string and replace all "," by "\\0" (the end-of-string marker, zero-terminator). 由于您有“,”符号作为分隔符,您可以对字符串执行线性扫描,并将所有“,”替换为“\\ 0”(字符串结束标记,零终止符)。

Something like this: 像这样的东西:

// tmp array for the line part values
double parts[MAX_PARTS];

while(getline(file, line))
{
    size_t len = line.length();
    size_t j;

    if(line.empty()) { continue; }

    const char* last_start = &line[0];
    int num_parts = 0;

    while(j < len)
    {
        if(line[j] == ',')
        {
           line[j] = '\0';

           if(num_parts == MAX_PARTS) { break; }

           parts[num_parts] = atof(last_start);
           j++;
           num_parts++;
           last_start = &line[j];
        }
        j++;
    }

    /// do whatever you need with the parts[] array
 }

I don't know if this will be quicker than the accepted answer, but I might as well post it anyway in case you wish to try it. 我不知道这是否会比接受的答案更快,但我还是可以发布它,以防你想尝试一下。 You can load in the entire contents of the file using a single read call by knowing the size of the file using some fseek magic. 通过使用一些fseek魔法知道文件的大小,您可以使用单个读取调用加载文件的整个内容 This will be much faster than multiple read calls. 这将比多次读取调用快得多。

You could then do something like this to parse your string: 然后你可以做这样的事情来解析你的字符串:

//Delimited string to vector
vector<string> dstov(string& str, string delimiter)
{
  //Vector to populate
  vector<string> ret;
  //Current position in str
  size_t pos = 0;
  //While the the string from point pos contains the delimiter
  while(str.substr(pos).find(delimiter) != string::npos)
  {
    //Insert the substring from pos to the start of the found delimiter to the vector
    ret.push_back(str.substr(pos, str.substr(pos).find(delimiter)));
    //Move the pos past this found section and the found delimiter so the search can continue
    pos += str.substr(pos).find(delimiter) + delimiter.size();
  }
  //Push back the final element in str when str contains no more delimiters
  ret.push_back(str.substr(pos));
  return ret;
}

string rawfiledata;

//This call will parse the raw data into a vector containing lines of
//20.5,30.5,20.5,0.753669,3.91888,0.749897,1 by treating the newline
//as the delimiter
vector<string> lines = dstov(rawfiledata, "\n");

//You can then iterate over the lines and parse them into variables and do whatever you need with them.
for(size_t itr = 0; itr < lines.size(); ++itr)
  vector<string> line_variables = dstov(lines[itr], ",");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM