简体   繁体   English

在C ++中逐行读取二进制文件

[英]Reading a binary file line by line in C++

I'm a beginner in C++ so I hope you bear with me. 我是C ++的初学者,所以希望您能与我同行。

Trying to read a file which in text format each has lines that either look like this (the first few lines, called header lines): 尝试读取一个文本格式的文件,每个文件都具有如下所示的行(前几行称为标题行):

@HD VN:1.5  SO:queryname

or like this 或像这样

read.1  4   *   0   0   *   *   0   0   CAACCNNTACCACAGCCCGANGCATTAACAACTTAANNNCNNNTNNANNNNNNNNNNNNTTGAAAAAAAAAAAAAAAAAA    A<.AA##F..<F)<)FF))<#A<7<F.)FA.FAA.)###.###F##)############)FF)A<..A..7A....<F.A    XC:Z:CAACCNNTACCA   RG:Z:A  XQ:i:2

Both are tab delimited. 两者都是制表符分隔的。

The file is very large and therefore is in binary format. 该文件非常大,因此为二进制格式。 I'm wondering whether it is possible to read from the binary format file each line, do some processing on that line, and then write it to a binary format output file. 我想知道是否有可能从二进制格式的文件中读取每一行,对该行进行一些处理,然后将其写入二进制格式的输出文件。

I started with this code: 我从以下代码开始:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char* argv[])
{
  string input_file = argv[1];
  string output_file = argv[2];
  string line;
  ifstream istream;
  istream.open(input_file.c_str(),ios::binary|ios::in);
  ofstream ostream;
  ostream.open(output_file.c_str(),ios::binary|ios::out);
  while(getline(istream,line,'\n')){
    if(line.empty()) continue;
    //process line assuming it is read as a string
    ostream<<line<<endl;
  }
  istream.close();
  ostream.close();
}

But it crashes with: Segmentation fault (core dumped) , in the part where I'm trying to parse line to a string vector . 但是它崩溃了: Segmentation fault (core dumped) ,在我试图将line解析为string vector

Is there a way to read the binary format and split it by lines, do string processing on each such line, and then write them to a binary output? 有没有一种方法可以读取二进制格式并将其按行分割,在这样的每一行上进行字符串处理,然后将其写入二进制输出?

BTW, I'm running this on Linux. 顺便说一句,我正在Linux上运行它。

Is it possible to read a binary file line by line? 是否可以逐行读取二进制文件?

Every file is, in principle, binary, because that's just how computers work. 原则上,每个文件都是二进制文件,因为这就是计算机的工作方式。 Now, saying "I'm trying to read it line-by-line" clearly means you're treating it as a text file – "line" is a text concept. 现在,说“我正在尝试逐行阅读”显然意味着您将其视为文本文件-“行”是一个文本概念。

The file is very large and therefore is in binary format. 该文件非常大,因此为二进制格式。

That's top-notch bullshit. 那是一流的废话。 Size doesn't change the format of your file. 大小不会更改文件的格式。

How do I get each line as a string? 如何获得每一行作为字符串? Does the ostream<<line<<endl; ostream<<line<<endl; work for writing a string to a binary file? 将字符串写入二进制文件的工作?

Yes and no: if you're file is not a text file, why is it important where these '\\n' characters are? 是和否:如果您的文件不是文本文件,为什么这些'\\n'字符在哪里很重要? To a non-text file, these are just normal bytes like 'a' or \\0x00 or 0xFF . 对于非文本文件,这些只是普通字节,例如'a'\\0x000xFF So basically, you're looking at 所以基本上,你在看 ingrain壁纸 and try to spot letters in there. 并尝试在其中找到字母。

However, with your illustration of the files we're talking about, they are in fact files that only contain text. 但是,通过您对我们正在讨论的文件的说明,它们实际上是仅包含文本的文件。

So your problem seems to lie in the fact that a single line might exceed what storage you have available in std::string . 因此,您的问题似乎在于以下事实:单行可能超出std::string可用的存储空间。 That's a rare case – but it can happen for genetic strings, it seems. 那是一种罕见的情况-看来遗传字符串可能会发生。 Well. 好。

Get yourself familiar with the non-text-oriented file I/O that C++ has. 使自己熟悉C ++具有的非文本文件I / O。 Basically, there's ifstream.read() and you should use it to get a (limited) amount of bytes, do your processing, write to output, repeat. 基本上,有ifstream.read() ,您应该使用它来获取(有限)字节数量,进行处理,写入输出并重复。 Look out for the newline character in your input, and "rewind" your file ( fseek ) if you've read past it. 在输入中查找换行符,如果已读过去,则“倒带”文件( fseek )。

Also, I really wonder how long your lines have to become to break std::string . 另外,我真的很想知道您的行必须中断std::string多长时间。 I guess you might be running on some very limited OS (32 bit?) or computer (very little RAM + Swap?). 我猜您可能在某些非常有限的OS(32位?)或计算机(很少的RAM + Swap?)上运行。

If your file is structured into lines, and each line is terminated with a \\n then it is a text file. 如果您的文件由几行构成,并且每行以\\n结尾,则它一个文本文件。 Every file is binary underneath, and text files are just a special kind of binary file. 每个文件都在下面是二进制文件,而文本文件只是一种特殊的二进制文件。

So, given that, the code you've shown is likely to work fine for files of any size. 因此,鉴于此,您显示的代码可能适用于任何大小的文件。

You should really remove the ios:binary , but I don't expect it to make any difference in this case. 您应该真正删除ios:binary ,但是在这种情况下,我不希望它有任何作用。

But if you're getting a crash while "processing" a line of the file, that's where the the bug is most likely to be - in the code you haven't disclosed - yet! 但是,如果在“处理”文件的一行时遇到崩溃,那么该bug最有可能是-在尚未公开的代码中-!

It looks like you file has some other line endings than you expect. 看来您的文件有一些其他行尾超出您的预期。 It could have a \\r while you expect it to have \\n . 当您期望它具有\\n时,它可能具有\\r If that is the case then std::getline tries to read whole 30GB file in to line std::string. 如果是这样的话,那么std::getline试图读取到整个30GB的文件line的std :: string。

I suggest you check what line ending you have in your file, to verify above. 建议您检查文件中的行尾,以进行上面的验证。 If that is the case then you can use line reading function from this SO: Getting std :: ifstream to handle LF, CR, and CRLF? 如果是这种情况,那么您可以从该SO中使用行读取功能: 获取std :: ifstream以处理LF,CR和CRLF? which should read lines even if they have ending non compatible with your platform (or rather endings which you do not expect). 即使行的结尾与您的平台不兼容(或不希望出现的结尾),也应读取行。

also, you should be fine using non-binary file mode. 同样,使用非二进制文件模式应该没问题。 The sample file lines you have shown in question does not look very binary to me. 对我来说,您所显示的示例文件行看起来不是很二进制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM