简体   繁体   English

C++ 二进制文件无法正确读取

[英]C++ binary files not read correctly

I am reading a file that is written in high endian on a little endian intel processor in c++.我正在阅读一个文件,该文件是在 C++ 中的小端英特尔处理器上以高端方式编写的。 The file is a generic file written in binary.该文件是用二进制编写的通用文件。 I have tried reading it using open() and fopen() both but they both seem to get the same thing wrong.我曾尝试使用 open() 和 fopen() 阅读它,但他们似乎都犯了同样的错误。 The file is a binary file for training images from the MNIST dataset.该文件是用于训练来自 MNIST 数据集的图像的二进制文件。 It contains 4 headers, each 32 bits in size and stored in high endian.它包含 4 个标头,每个标头大小为 32 位,并以高位序存储。 My code is working, it is just not giving the right value for the 2nd header.我的代码正在运行,只是没有为第二个标题提供正确的值。 It works for the rest of the headers.它适用于其余的标题。 I even opened the file in a hex editor to see if the value might be wrong but it is right.我什至在十六进制编辑器中打开了该文件,以查看该值是否可能是错误的,但它是正确的。 The program, for some weird reason, reads only the value of the second header wrong: Here is the code that deals with reading the headers only:由于某种奇怪的原因,该程序只读取了第二个标头的值错误:这是处理仅读取标头的代码:

void DataHandler::readInputData(std::string path){
    uint32_t headers[4];
    char bytes[4];
    std::ifstream file;
    //I tried both open() and fopen() as seen below
    file.open(path.c_str(), std::ios::binary | std::ios::in);
    //FILE* f = fopen(path.c_str(), "rb");
    if (file)
    {
        int i = 0;
        while (i < 4)//4 headers
        {
            //if (fread(bytes, sizeof(bytes), 1, f))
            //{
            //    headers[i] = format(bytes);
            //    ++i;
            //}
            file.read(bytes, sizeof(bytes));
            headers[i++] = format(bytes);
        }
        printf("Done getting images file header.\n");
        printf("magic: 0x%08x\n", headers[0]);
        printf("nImages: 0x%08x\n", headers[1]);//THIS IS THE ONE THAT IS GETTING READ WRONG
        printf("rows: 0x%08x\n", headers[2]);
        printf("cols: 0x%08x\n", headers[3]);
        exit(1);
        //reading rest of the file code here
    }
    else
    {
        printf("Invalid Input File Path\n");
        exit(1);
    }
}

//converts high endian to little indian (required for Intel Processors)
uint32_t DataHandler::format(const char * bytes) const
{
    return (uint32_t)((bytes[0] << 24) |
        (bytes[1] << 16) |
        (bytes[2] << 8) |
        (bytes[3]));
}

Output I am getting is:我得到的输出是:

Done getting images file header.
magic: 0x00000803
nImages: 0xffffea60
rows: 0x0000001c
cols: 0x0000001c

nImages should be 60,000 or (0000ea60)h in hex but it is reading it as ffff... for some reason. nImages 应该是 60,000 或 (0000ea60)h 的十六进制,但它正在读取它作为 ffff ......出于某种原因。 Here is the file opened in a hex editor:这是在十六进制编辑器中打开的文件: 十六进制编辑器中的文件 As we can see, the 2nd 32 bit number is 0000ea60 but it is reading it wrong...正如我们所看到的,第二个 32 位数字是 0000ea60 但它读错了......

It seems that char is signed in your environment and therefore 0xEA in the data is sign-extended to 0xFFFFFFEA .似乎char在您的环境中已签名,因此数据中的0xEA被符号扩展为0xFFFFFFEA This will break the higher digits.这将打破较高的数字。

To prevent this, you should use unsigned char instead of char .为了防止这种情况,您应该使用unsigned char而不是char (for both of element type of bytes and the argument of format() ) (对于bytes的元素类型和format()的参数)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM