在c ++中读取csv文件的最快方法，它包含大量的列和行

Question

I have a pipe-delimited data file with more than 13 columns. 我有一个管道分隔的数据文件超过13列。 The total file size is above 100 MB. 总文件大小超过100 MB。 I am reading each row, splitting the string into a std::vector<std::string> so I can do calculations. 我正在读取每一行，将字符串拆分为std::vector<std::string>以便我可以进行计算。 I repeat this process for all the rows in the file like below: 我对文件中的所有行重复此过程，如下所示：

    string filename = "file.dat";
    fstream infile(filename);
    string line;
    while (getline(infile, line)) {
        string item;
        stringstream ss(line);
        vector<string> splittedString;
        while (getline(ss, item, '|')) {
            splittedString.push_back(item);
        }
        int a = stoi(splittedString[0]); 
        // I do some processing like this before some manipulation and calculations with the data
    }

This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. 然而，这非常耗时，我很确定它不是读取CSV类型文件的最佳方式。 How can this be improved? 如何改进？

update 更新

I tried using the boost::split function instead of a while loop but it is actually even slower. 我尝试使用boost::split函数而不是while循环，但它实际上甚至更慢。

Answer 1

You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have. 您没有CSV文件，因为CSV代表您没有的逗号分隔值。
You have a delimited text file (apparently delimited by a "|" ). 您有一个分隔的文本文件（显然由"|"分隔）。 Parsing CSV is more complicated that simply splitting on "," . 解析CSV更简单，只需拆分"," 。

Anyway, without too many dramatic changes to your approach, here are a few suggestions: 无论如何，如果没有太多戏剧性的变化，这里有一些建议：

Use (more) buffering 使用（更多）缓冲
Move vector out of the loop and clear() it in every iteration. 将vector移出循环并在每次迭代中clear()它。 That will save on heap reallocations. 这将节省堆重新分配。
Use string::find() instead of stringstream to split the string. 使用string::find()而不是stringstream来拆分字符串。

Something like this... 像这样......

using namespace std;
int main() {
    string filename = "file.dat";
    fstream infile(filename);
    char buffer[65536];
    infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    string line;
    vector<string> splittedString;
    while (getline(infile, line)) {
        splittedString.clear();
        size_t last = 0, pos = 0;
        while ((pos = line.find('|', last)) != std::string::npos) {
            splittedString.emplace_back(line, last, pos - last);
            last = pos + 1;
        }
        if (last)
            splittedString.emplace_back(line, last);
        int a = stoi(splittedString[0]);
        // I do some processing like this before some manipulation and calculations with the data
    }
}

Answer 2

You can save another 50% by eliminating "vector splittedString;" 通过消除“vector splittedString;”可以节省另外50％ and using in-place parsing with strtok_s() 并使用strtok_s（）进行就地解析

int main() {
auto t1 = high_resolution_clock::now();
long long a(0);

string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {

    char * pch = const_cast<char*>(line.data());
    char *nextToken = NULL;
    pch = strtok_s(pch, "|", &nextToken);
    while (pch != NULL)
    {
        a += std::stoi(pch);
        pch = strtok_s(NULL, "|", &nextToken);
    }
}

auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";

} }

在c ++中读取csv文件的最快方法，它包含大量的列和行

问题描述

update 更新

2 个解决方案

解决方案1
4 已采纳 2019-07-17 09:38:29

解决方案2
1 2019-08-06 21:25:23

在c ++中读取csv文件的最快方法，它包含大量的列和行

问题描述

update 更新

2 个解决方案

解决方案1 4 已采纳 2019-07-17 09:38:29

解决方案2 1 2019-08-06 21:25:23

解决方案1
4 已采纳 2019-07-17 09:38:29

解决方案2
1 2019-08-06 21:25:23