简体   繁体   English

C++程序在读取大型二进制文件时放弃

[英]C++ program gives up when reading large binary file

I'm using a file from the MNIST website as an example;我以MNIST 网站上的文件为例; specifically, t10k-images-idx3-ubyte.gz .具体来说, t10k-images-idx3-ubyte.gz To reproduce my problem, download that file and unzip it, and you should get a file named t10k-images.idx3-ubyte , which is the file we're reading from.要重现我的问题,请下载该文件并解压缩,您应该会得到一个名为t10k-images.idx3-ubyte的文件,这是我们正在读取的文件。

My problem is that when I try to read bytes from this file in one big block, it seems to read a bit of it and then sort of gives up.我的问题是,当我尝试从这个文件中读取一个大块中的字节时,它似乎读取了一点,然后就放弃了。 Below is a bit of C++ code that attempts to read (almost) all of the file at once, and then dumps it into a text file for debugging purposes.下面是一段 C++ 代码,它尝试一次读取(几乎)所有文件,然后将其转储到文本文件中以进行调试。 (Excuse me for the unnecessary #include s.) For context, the file is a binary file whose first 16 bytes are magic numbers, which is why I seek to byte 16 before reading it. (请原谅不必要的#include s。)对于上下文,该文件是一个二进制文件,其前 16 个字节是幻数,这就是为什么我在阅读它之前先寻找第 16 个字节。 Bytes 16 to the end are raw greyscale pixel values of 10,000 images of size 28 x 28.字节 16 到末尾是 10,000 个大小为 28 x 28 的图像的原始灰度像素值。

#include <array>
#include <iostream>
#include <fstream>
#include <string>
#include <exception>
#include <vector>

int main() {
  try {
    std::string path = R"(Drive:\Path\To\t10k-images.idx3-ubyte)";
    std::ifstream inputStream {path};
    inputStream.seekg(16);  // Skip the magic numbers at the beginning.
    char* arrayBuffer = new char[28 * 28 * 10000];  // Allocate memory for 10,000 greyscale images of size 28 x 28.
    inputStream.read(arrayBuffer, 28 * 28 * 10000);
    std::ofstream output {R"(Drive:\Path\To\PixelBuffer.txt)"};  // Output file for debugging.
    for (size_t i = 0; i < 28 * 28 * 10000; i++) {
      output << static_cast<short>(arrayBuffer[i]);
      // Below prints a new line after every 28 pixels.
      if ((i + 1) % 28 == 0) {
        output << "\n";
      }
      else {
        output << " ";
      }
    }
    std::cout << inputStream.good() << std::endl;
    std::cout << "WTF?" << std::endl;  // LOL. I just use this to check that everything's actually been executed, because sometimes the program shits itself and quits silently.
    delete[] arrayBuffer;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
  } catch (...) {
    std::cout << "WTF happened!?!?" << std::endl;
  }
  return 0;
}

When you run the code (after modifying the paths appropriately), and check the output text file, you will see that the file initially contains legitimate byte values from the file (integers between -128 and 127, but mostly 0), but as you scroll down, you will find that after some legitimate values, the printed values becomes all the same (in my case either all 0 or all -51 for some reason).当您运行代码(在适当地修改路径之后)并检查输出文本文件时,您会看到该文件最初包含来自文件的合法字节值(-128 和 127 之间的整数,但大多为 0),但是当您向下滚动,您会发现在一些合法值之后,打印的值变得完全相同(在我的情况下,出于某种原因,全为 0 或全为 -51)。 What you see may be different on your computer, but in any case, they would be what I assume to be uninitialised bytes.您在计算机上看到的内容可能有所不同,但无论如何,它们都是我认为未初始化的字节。 So it seems that ifstream::read() works for a while, but gives up and stops reading very quickly.所以似乎ifstream::read()工作了一段时间,但很快就放弃并停止阅读。 Am I missing something basic?我错过了一些基本的东西吗? Like, is there a buffer limit on the amount of bytes I can read at once that I don't know about?就像,我一次可以读取的字节数是否有缓冲区限制,我不知道?

EDIT Oh by the way I'm on Windows.编辑哦,顺便说一句,我在 Windows 上。

To read binary files you need to use std::ifstream inputStream(path, std::ios_base::binary) else could happen that the application does not read the right things.要读取二进制文件,您需要使用std::ifstream inputStream(path, std::ios_base::binary)否则可能会发生应用程序未读取正确内容的情况。

So, the correct code is所以,正确的代码是

#include <array>
#include <iostream>
#include <fstream>
#include <string>
#include <exception>
#include <vector>

int main() {
  try {
    std::string path = R"(Drive:\Path\To\t10k-images.idx3-ubyte)";
    std::ifstream inputStream (path, std::ios_base::binary);
    inputStream.seekg(16);  // Skip the magic numbers at the beginning.
    char* arrayBuffer = new char[28 * 28 * 10000];  // Allocate memory for 10,000 greyscale images of size 28 x 28.
    inputStream.read(arrayBuffer, 28 * 28 * 10000);
    std::ofstream output {R"(Drive:\Path\To\PixelBuffer.txt)"};  // Output file for debugging.
    for (size_t i = 0; i < 28 * 28 * 10000; i++) {
      output << static_cast<short>(arrayBuffer[i]);
      // Below prints a new line after every 28 pixels.
      if ((i + 1) % 28 == 0) {
        output << "\n";
      }
      else {
        output << " ";
      }
    }
    std::cout << inputStream.good() << std::endl;
    std::cout << "WTF?" << std::endl;  // LOL. I just use this to check that everything's actually been executed, because sometimes the program shits itself and quits silently.
    delete[] arrayBuffer;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
  } catch (...) {
    std::cout << "WTF happened!?!?" << std::endl;
  }
  return 0;
}

and this not depends on platform ( ex. Windows, Linux, etc )这不取决于平台(例如 Windows、Linux 等)

Concerning OPs code to open the binary file:关于打开二进制文件的 OPs 代码:

std::ifstream inputStream {path};

It should be:它应该是:

std::ifstream inputStream(path, std::ios::binary);

It's a common trap on Windows:这是 Windows 上的常见陷阱:

A file stream should be opened with std::ios::binary to read or write binary files.应使用std::ios::binary打开文件流以读取或写入二进制文件。

cppreference.com has a nice explanation concerning this topic: cppreference.com对此主题有一个很好的解释:

Binary and text modes二进制和文本模式

A text stream is an ordered sequence of characters that can be composed into lines;文本流是可以组合成行的有序字符序列; a line can be decomposed into zero or more characters plus a terminating '\n' (“newline”) character.一行可以分解为零个或多个字符加上一个终止'\n' (“换行符”)字符。 Whether the last line requires a terminating '\n' is implementation-defined.最后一行是否需要终止'\n'是实现定义的。 Furthermore, characters may have to be added, altered, or deleted on input and output to conform to the conventions for representing text in the OS (in particular, C streams on Windows OS convert '\n' to '\r\n' on output, and convert '\r\n' to '\n' on input).此外,可能必须在输入和输出上添加、更改或删除字符,以符合操作系统中表示文本的约定(特别是,Windows 操作系统上的 C 流将'\n'转换为'\r\n' on输出,并在输入时将'\r\n'转换为'\n' )。

Data read in from a text stream is guaranteed to compare equal to the data that were earlier written out to that stream only if each of the following is true:仅当以下各项为真时,才能保证从文本流中读取的数据与之前写入该流的数据相比较:

  • The data consist of only printing characters and/or the control characters '\t' and '\n' (in particular, on Windows OS, the character '\0x1A' terminates input).数据仅包含打印字符和/或控制字符'\t''\n' (特别是在 Windows 操作系统上,字符'\0x1A'终止输入)。
  • No '\n' character is immediately preceded by space characters (such space characters may disappear when such output is later read as input).没有'\n'字符紧跟在空格字符之前(这些空格字符可能会在以后将此类输出读取为输入时消失)。
  • The last character is '\n' .最后一个字符是'\n'

A binary stream is an ordered sequence of characters that can transparently record internal data.二进制流是可以透明地记录内部数据的有序字符序列。 Data read in from a binary stream always equal the data that were earlier written out to that stream, except that an implementation is allowed to append an indeterminate number of null characters to the end of the stream.从二进制流中读入的数据总是等于早先写入该流中的数据,除非允许实现将不确定数量的空字符附加到流的末尾。 A wide binary stream doesn't need to end in the initial shift state.宽二进制流不需要以初始移位状态结束。

It's a good idea to use the std::ios::binary for stream I/O of binary files on any platform.在任何平台上使用std::ios::binary进行二进制文件的流 I/O 是一个好主意。 It doesn't have any effect on platforms where it doesn't make a difference (eg Linux).它对没有影响的平台没有任何影响(例如Linux)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM