简体   繁体   English

用C ++解析巨大的csv文件

[英]parse huge csv file with C++

I order to simulate my network I am using a trace file (csv file) with a size between 5 to 30 GB. 我要模拟网络,我使用的跟踪文件(csv文件)的大小在5到30 GB之间。 The csv file is a row based, where each row contains multiple fields delimited by a space and forming teh information to form a network packet: csv文件是基于行的,其中每行包含多个由空格分隔并形成信息以形成网络数据包的字段:

3     53      4    12    1     1  2  6

Since the file's size could reach several GBs (millions of lines), is it better to divided it into small chunks myfile00.csv, myfile01.csv..., or I can process the entire file on the hard drive without being loaded into the memory? 由于文件的大小可能达到数GB(几百万行),因此最好将其分成小块myfile00.csv,myfile01.csv ...,或者我可以在不将其加载到硬盘的情况下处理整个文件。记忆? I want to read the file line by line at a specific time, which is the clock cycle of the simulation, and get all information in the line to create an omnet++ message. 我想在特定时间(即模拟的时钟周期)逐行读取文件,并获取该行中的所有信息以创建omnet ++消息。

packet MyTrace::getpacket() {
int id; // first field
int cycle; // second field
int source; // third field
int destination; // fourth field
int numberofDep; // fifth field
std::list<int> listofDep; // remaining fields

if (!traceFile.is_open()) {
 // get id
 // get cycle
 // ....
}

Any suggestion would be helpful. 任何建议都会有所帮助。

EDIT: 编辑:

  string line;
  ifstream myfile ("BlackSmall.csv");
    int currentline=0 ;
  if (myfile.is_open())
  {
   while (getline(myfile, line)) {  

      istringstream ss(line);
      string request;
      int id, cycle, source , dest, srcType, destType, packetSize, dependency;
      int listdep;
      std::list<int> dep; 
              ss >> id; 
              ss>> cycle; 
              ss>> source; 
              ss>> dest;
              ss>>request;
              ss>> srcType;
              ss>> destType;
              ss>> packetSize;
              ss>> dependency; 
              while (ss >> listdep) dep.push_back(listdep);
           // Create my packet

    }
    myfile.close();
    }
  else cout << "Unable to open file"; 

With the above code, I can get all information that I need from a line. 使用上面的代码,我可以从一行中获取所有我需要的信息。 The problem is that I need to use this code inside a class, which when I call it returns just one line's information. 问题是我需要在一个类中使用此代码,当我调用它时,它仅返回一行的信息。 Is there a way how to point to a specific line when I call this class? 调用此类时,有没有一种方法可以指向特定的行?

It seems like your application seems to require a single sequential pass through the input, so processing a file that is 1GB or 100GB is perhaps just a matter of patience and perhaps parallelism. 似乎您的应用程序似乎需要对输入进行一次顺序传递,因此处理1GB或100GB的文件可能只是耐心和并行性的问题。

The approach should be to translate records line-by-line. 方法应该是逐行翻译记录。 You should avoid strategies that attempt to read the entire file into memory. 您应该避免尝试将整个文件读入内存的策略。 The STL offers the easy-to-use std::ifstream class with a built-in getline method, which returns a std::string containing the line to be converted. STL提供带有内置getline方法的易于使用的std :: ifstream类,该类返回包含要转换行的std :: string。

If you are feeling more ambitious and want to control the amount of data read or buffered more carefully then you would not be the first developer to roll-your-own code to implement a buffered reader. 如果您有更大的野心并且想要更仔细地控制读取或缓冲的数据量,那么您将不是第一个开发自己的代码以实现缓冲的读取器的开发人员。 This is a fairly empowering exercise and will help you think through some corner cases with reading partial lines and such. 这是一个相当授权的练习,将帮助您通过阅读一些局部线条等来思考一些极端情况。 But in the end, it probably will not give you a significant boost toward your goal. 但是最后,它可能不会大大提高您的目标。 I suspect the ifstream approach will get you up and running without the hassle and will not ultimately be the bottleneck in processing these files. 我怀疑ifstream方法将使您毫无麻烦地启动并运行,并且最终不会成为处理这些文件的瓶颈。

If you were really concerned about optimizing execution time then having multiple files might help you launch parallel processing tasks. 如果您真的担心优化执行时间,那么拥有多个文件可能会帮助您启动并行处理任务。

// define a class to hold your custom record
class Record {
};

// create a parser function to convert a line of text into the record
bool parse(std::string const &line, Record &record) {
}

// create a translator method to convert a record into the desired output
bool write(Record const &record, std::ofstream &os) {
}

// actually open input stream for the input file
std::ifstream is;
std::ofstream os;
std::string line;

while (std::getline(is,line)) {
  Record record;
  if (!parse(line,record)) break;
  if (!write(record,os)) break;
}

You can re-use the Record instance by moving it outside the while loop so long as you are careful to reset the variable so that information from preceding records does not taint the current record. 您可以通过将其移动到while循环之外来重用Record实例,只要您小心地重置变量,以使先前记录中的信息不会污染当前记录即可。 You can also dive head first into the C++ ecosystem by producing stream input and output operator ("<<",">>") but I personally find this approach to be more confusion than it is worth. 您还可以通过生成流输入和输出运算符(“ <<”,“ >>”)首先进入C ++生态系统,但是我个人认为这种方法比它值得的更多混乱。

Perhaps best approach for you would be to import your CSV file into SQLite database. 也许对您来说最好的方法是将CSV文件导入SQLite数据库。

Once you import it and add some indexes, you can easily and very efficiently query necessary rows from that database. 导入并添加一些索引后,您可以轻松高效地从该数据库查询必要的行。 SQLite has lots of ready-to-use C/C++ client libraries available, you can start with default one at https://www.sqlite.org/cintro.html . SQLite提供了许多现成的C / C ++客户端库,您可以从https://www.sqlite.org/cintro.html的默认库开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM