简体   繁体   English

如何将二进制文件读入无符号字符向量

[英]How to read a binary file into a vector of unsigned chars

Lately I've been asked to write a function that reads the binary file into the std::vector<BYTE> where BYTE is an unsigned char .最近我被要求编写一个函数,将二进制文件读入std::vector<BYTE> ,其中BYTE是一个unsigned char Quite quickly I came with something like this:很快我就得到了这样的东西:

#include <fstream>
#include <vector>
typedef unsigned char BYTE;

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::streampos fileSize;
    std::ifstream file(filename, std::ios::binary);

    // get its size:
    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // read the data:
    std::vector<BYTE> fileData(fileSize);
    file.read((char*) &fileData[0], fileSize);
    return fileData;
}

which seems to be unnecessarily complicated and the explicit cast to char* that I was forced to use while calling file.read doesn't make me feel any better about it.这似乎不必要地复杂,并且我在调用file.read时被迫使用的显式转换为char*并没有让我感觉更好。


Another option is to use std::istreambuf_iterator :另一种选择是使用std::istreambuf_iterator

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<char>(file)),
                              std::istreambuf_iterator<char>());
}

which is pretty simple and short, but still I have to use the std::istreambuf_iterator<char> even when I'm reading into std::vector<unsigned char> .这非常简单和简短,但即使我正在读入std::vector<unsigned char> ,我仍然必须使用std::istreambuf_iterator<char> std::vector<unsigned char>


The last option that seems to be perfectly straightforward is to use std::basic_ifstream<BYTE> , which kinda expresses it explicitly that "I want an input file stream and I want to use it to read BYTE s" :最后一个似乎非常简单的选项是使用std::basic_ifstream<BYTE> ,这有点明确表示“我想要一个输入文件流,我想用它来读取BYTE s”

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::basic_ifstream<BYTE> file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<BYTE>(file)),
                              std::istreambuf_iterator<BYTE>());
}

but I'm not sure whether basic_ifstream is an appropriate choice in this case.但我不确定在这种情况下basic_ifstream是否是合适的选择。

What is the best way of reading a binary file into the vector ?将二进制文件读入vector的最佳方法是什么? I'd also like to know what's happening "behind the scene" and what are the possible problems I might encounter (apart from stream not being opened properly which might be avoided by simple is_open check).我还想知道“幕后”发生什么以及我可能遇到的可能问题是什么(除了流没有被正确打开,这可以通过简单的is_open检查来避免)。

Is there any good reason why one would prefer to use std::istreambuf_iterator here?有什么好的理由让人们更喜欢在这里使用std::istreambuf_iterator吗?
(the only advantage that I can see is simplicity) (我能看到的唯一优点是简单)

When testing for performance, I would include a test case for: 在测试性能时,我会包含一个测试用例:

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // Stop eating new lines in binary mode!!!
    file.unsetf(std::ios::skipws);

    // get its size:
    std::streampos fileSize;

    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // reserve capacity
    std::vector<BYTE> vec;
    vec.reserve(fileSize);

    // read the data:
    vec.insert(vec.begin(),
               std::istream_iterator<BYTE>(file),
               std::istream_iterator<BYTE>());

    return vec;
}

My thinking is that the constructor of Method 1 touches the elements in the vector , and then the read touches each element again. 我的想法是方法1的构造函数接触vector的元素,然后read再次触及每个元素。

Method 2 and Method 3 look most promising, but could suffer one or more resize 's. 方法2和方法3看起来最有希望,但可能遭受一个或多个resize Hence the reason to reserve before reading or inserting. 因此在阅读或插入之前reserve的原因。

I would also test with std::copy : 我也会测试std::copy

...
std::vector<byte> vec;
vec.reserve(fileSize);

std::copy(std::istream_iterator<BYTE>(file),
          std::istream_iterator<BYTE>(),
          std::back_inserter(vec));

In the end, I think the best solution will avoid operator >> from istream_iterator (and all the overhead and goodness from operator >> trying to interpret binary data). 最后,我认为最好的解决方案将避免operator >>来自istream_iterator (以及来自operator >>所有开销和优点)试图解释二进制数据)。 But I don't know what to use that allows you to directly copy the data into the vector. 但我不知道如何使用它可以直接将数据复制到矢量中。

Finally, my testing with binary data is showing ios::binary is not being honored. 最后,我使用二进制数据进行的测试显示ios::binary没有得到尊重。 Hence the reason for noskipws from <iomanip> . 因此,来自<iomanip> noskipws的原因。

std::ifstream stream("mona-lisa.raw", std::ios::in | std::ios::binary);
std::vector<uint8_t> contents((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());

for(auto i: contents) {
    int value = i;
    std::cout << "data: " << value << std::endl;
}

std::cout << "file size: " << contents.size() << std::endl;

Since you are loading the entire file into memory the most optimal version is to map the file into memory. 由于您要将整个文件加载到内存中,因此最佳版本是将文件映射到内存中。 This is because the kernel loads the file into kernel page cache anyway and by mapping the file you just expose those pages in the cache into your process. 这是因为内核无论如何都要将文件加载到内核页面缓存中,并通过映射文件将缓存中的那些页面暴露到您的进程中。 Also known as zero-copy. 也称为零拷贝。

When you use std::vector<> it copies the data from the kernel page cache into std::vector<> which is unnecessary when you just want to read the file. 当你使用std::vector<>它会将数据从内核页面缓存复制到std::vector<> ,当你只想读取文件时这是不必要的。

Also, when passing two input iterators to std::vector<> it grows its buffer while reading because it does not know the file size. 此外,当将两个输入迭代器传递给std::vector<>它会在读取时增大其缓冲区,因为它不知道文件大小。 When resizing std::vector<> to the file size first it needlessly zeroes out its contents because it is going to be overwritten with file data anyway. 当首先将std::vector<>大小调整为文件大小时,它会不必要地将其内容清零,因为无论如何它都会被文件数据覆盖。 Both of the methods are sub-optimal in terms of space and time. 这两种方法在空间和时间方面都是次优的。

I would have thought that the first method, using the size and using stream::read() would be the most efficient. 我原以为第一种方法,使用大小并使用stream::read()将是最有效的。 The "cost" of casting to char * is most likely zero - casts of this kind simply tell the compiler that "Hey, I know you think this is a different type, but I really want this type here...", and does not add any extra instrucitons - if you wish to confirm this, try reading the file into a char array, and compare the actual assembler code. 铸造到char *的“成本”很可能是零 - 这种类型的演员只是告诉编译器“嘿,我知道你认为这是一个不同的类型,但我真的想要这种类型......”,并且不添加任何额外的指令 - 如果您想确认这一点,请尝试将文件读入char数组,并比较实际的汇编代码。 Aside from a little bit of extra work to figure out the address of the buffer inside the vector, there shouldn't be any difference. 除了一些额外的工作来计算向量内的缓冲区的地址,应该没有任何区别。

As always, the only way to tell for sure IN YOUR CASE what is the most efficient is to measure it. 与往常一样,唯一可以确保在您的情况下最有效的方法是测量它。 "Asking on the internet" is not proof. “在互联网上询问”并不是证明。

The class below extends vector with a binary file load and save.下面的类通过二进制文件加载和保存扩展了向量。 I returned to this question multiple times already, so this is the code for my next return - and for all others who will be looking for the binary file save method next.我已经多次返回这个问题,所以这是我下一次返回的代码 - 以及接下来将寻找二进制文件保存方法的所有其他人。 :) :)

#include <cinttypes>
#include <fstream>
#include <vector>

class FileVector : public std::vector<uint8_t>
{
    public:

        using std::vector<uint8_t>::vector;

        void loadFromFile(const char *filename)
        {
            std::ifstream file(filename, std::ios::in | std::ios::binary);
            insert(begin(),
                std::istream_iterator<uint8_t>(file),
                std::istream_iterator<uint8_t>());
        }

        void saveTofile(const char *filename) const
        {
            std::ofstream file(filename, std::ios::out | std::ios::binary);
            file.write((const char *) data(), size());
            file.close();
        }
};

NOTE: For load optimization please consider determining file size and pre-allocating required space as mentioned in other comments here.注意:对于负载优化,请考虑确定文件大小并预先分配所需空间,如此处其他评论中所述。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM