[英]The fastest way to read csv file in c++ which contains large no of columns and rows
I have a pipe-delimited data file with more than 13 columns. 我有一个管道分隔的数据文件超过13列。 The total file size is above 100 MB.
总文件大小超过100 MB。 I am reading each row, splitting the string into a
std::vector<std::string>
so I can do calculations. 我正在读取每一行,将字符串拆分为
std::vector<std::string>
以便我可以进行计算。 I repeat this process for all the rows in the file like below: 我对文件中的所有行重复此过程,如下所示:
string filename = "file.dat";
fstream infile(filename);
string line;
while (getline(infile, line)) {
string item;
stringstream ss(line);
vector<string> splittedString;
while (getline(ss, item, '|')) {
splittedString.push_back(item);
}
int a = stoi(splittedString[0]);
// I do some processing like this before some manipulation and calculations with the data
}
This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. 然而,这非常耗时,我很确定它不是读取CSV类型文件的最佳方式。 How can this be improved?
如何改进?
I tried using the boost::split
function instead of a while loop but it is actually even slower. 我尝试使用
boost::split
函数而不是while循环,但它实际上甚至更慢。
You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have. 您没有CSV文件,因为CSV代表您没有的逗号分隔值。
You have a delimited text file (apparently delimited by a "|"
). 您有一个分隔的文本文件(显然由
"|"
分隔)。 Parsing CSV is more complicated that simply splitting on ","
. 解析CSV更简单,只需拆分
","
。
Anyway, without too many dramatic changes to your approach, here are a few suggestions: 无论如何,如果没有太多戏剧性的变化,这里有一些建议:
vector
out of the loop and clear()
it in every iteration. vector
移出循环并在每次迭代中clear()
它。 That will save on heap reallocations. string::find()
instead of stringstream
to split the string. string::find()
而不是stringstream
来拆分字符串。 Something like this... 像这样......
using namespace std;
int main() {
string filename = "file.dat";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
vector<string> splittedString;
while (getline(infile, line)) {
splittedString.clear();
size_t last = 0, pos = 0;
while ((pos = line.find('|', last)) != std::string::npos) {
splittedString.emplace_back(line, last, pos - last);
last = pos + 1;
}
if (last)
splittedString.emplace_back(line, last);
int a = stoi(splittedString[0]);
// I do some processing like this before some manipulation and calculations with the data
}
}
You can save another 50% by eliminating "vector splittedString;" 通过消除“vector splittedString;”可以节省另外50% and using in-place parsing with strtok_s()
并使用strtok_s()进行就地解析
int main() {
auto t1 = high_resolution_clock::now();
long long a(0);
string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {
char * pch = const_cast<char*>(line.data());
char *nextToken = NULL;
pch = strtok_s(pch, "|", &nextToken);
while (pch != NULL)
{
a += std::stoi(pch);
pch = strtok_s(NULL, "|", &nextToken);
}
}
auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";
} }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.