使用内存映射文件在 C++ 中解析二进制文件太慢

Question

I'm trying to parse a binary file integer-wise in order to check whether the integer value fulfills a certain condition but the loop is very slow.我正在尝试以整数方式解析二进制文件，以检查 integer 值是否满足特定条件，但循环非常慢。

Furthermore, I found that memory-mapped file s are the fastest for reading a file into the memory quickly, hence I'm using the following Boost -based code:此外，我发现memory-mapped file是快速将文件读入 memory 的最快速度，因此我使用以下基于Boost的代码：

unsigned long long int get_file_size(const char *file_path) {
    const filesystem::path file{file_path};
    const auto generic_path = file.generic_path();
    return filesystem::file_size(generic_path);
}

boost::iostreams::mapped_file_source read_bytes(const char *file_path,
                                         const unsigned long long int offset,
                                         const unsigned long long int length) {
    boost::iostreams::mapped_file_params parameters;
    parameters.path = file_path;
    parameters.length = static_cast<size_t>(length);
    parameters.flags = boost::iostreams::mapped_file::mapmode::readonly;
    parameters.offset = static_cast<boost::iostreams::stream_offset>(offset);

    boost::iostreams::mapped_file_source file;

    file.open(parameters);
    return file;
}

boost::iostreams::mapped_file_source read_bytes(const char *file_path) {
    const auto file_size = get_file_size(file_path);
    const auto mapped_file_source = read_bytes(file_path, 0, file_size);
    return mapped_file_source;
}

My test case roughly looks as follows:我的测试用例大致如下：

inline auto test_parsing_binary_file_performance() {
    const auto start_time = get_time();
    const std::filesystem::path input_file_path = "...";
    const auto mapped_file_source = read_bytes(input_file_path.string().c_str());
    const auto file_buffer = mapped_file_source.data();
    const auto file_buffer_size = mapped_file_source.size();
    LOG_S(INFO) << "File buffer size: " << file_buffer_size;
    auto printed_lap = (long) (file_buffer_size / (double) 1000);
    printed_lap = round_to_nearest_multiple(printed_lap, sizeof(int));
    LOG_S(INFO) << "Printed lap: " << printed_lap;
    std::vector<int> values;
    values.reserve(file_buffer_size / sizeof(int)); // Pre-allocate a large enough vector
    // Iterate over every integer
    for (auto file_buffer_index = 0; file_buffer_index < file_buffer_size; file_buffer_index += sizeof(int)) {
        const auto value = *(int *) &file_buffer[file_buffer_index];
        if (value >= 0x30000000 && value < 0x49000000 - sizeof(int) + 1) {
            values.push_back(value);
        }

        if (file_buffer_index % printed_lap == 0) {
            LOG_S(INFO) << std::setprecision(4) << file_buffer_index / (double) file_buffer_size * 100 << "%";
        }
    }

    LOG_S(INFO) << "Values found count: " << values.size();

    print_time_taken(start_time, false, "Parsing binary file");
}

The memory-mapped file reading finishes almost instantly as expected but iterating it integer-wise is way too slow on my machine despite excellent hardware ( SSD etc.): memory-mapped file读取几乎按预期立即完成，但在我的机器上以整数方式迭代它太慢了，尽管硬件非常好（ SSD等）：

2020-12-20 13:04:35.124 (   0.019s) [main thread     ]Tests.hpp:387   INFO| File buffer size: 419430400
2020-12-20 13:04:35.124 (   0.019s) [main thread     ]Tests.hpp:390   INFO| Printed lap: 419432
2020-12-20 13:04:35.135 (   0.029s) [main thread     ]Tests.hpp:405   INFO| 0%
2020-12-20 13:04:35.171 (   0.065s) [main thread     ]Tests.hpp:405   INFO| 0.1%
2020-12-20 13:04:35.196 (   0.091s) [main thread     ]Tests.hpp:405   INFO| 0.2%
2020-12-20 13:04:35.216 (   0.111s) [main thread     ]Tests.hpp:405   INFO| 0.3%
2020-12-20 13:04:35.241 (   0.136s) [main thread     ]Tests.hpp:405   INFO| 0.4%
2020-12-20 13:04:35.272 (   0.167s) [main thread     ]Tests.hpp:405   INFO| 0.5%
2020-12-20 13:04:35.293 (   0.188s) [main thread     ]Tests.hpp:405   INFO| 0.6%
2020-12-20 13:04:35.314 (   0.209s) [main thread     ]Tests.hpp:405   INFO| 0.7%
2020-12-20 13:04:35.343 (   0.237s) [main thread     ]Tests.hpp:405   INFO| 0.8%
2020-12-20 13:04:35.366 (   0.261s) [main thread     ]Tests.hpp:405   INFO| 0.9%
2020-12-20 13:04:35.399 (   0.293s) [main thread     ]Tests.hpp:405   INFO| 1%
2020-12-20 13:04:35.421 (   0.315s) [main thread     ]Tests.hpp:405   INFO| 1.1%
2020-12-20 13:04:35.447 (   0.341s) [main thread     ]Tests.hpp:405   INFO| 1.2%
2020-12-20 13:04:35.468 (   0.362s) [main thread     ]Tests.hpp:405   INFO| 1.3%
2020-12-20 13:04:35.487 (   0.382s) [main thread     ]Tests.hpp:405   INFO| 1.4%
2020-12-20 13:04:35.520 (   0.414s) [main thread     ]Tests.hpp:405   INFO| 1.5%
2020-12-20 13:04:35.540 (   0.435s) [main thread     ]Tests.hpp:405   INFO| 1.6%
2020-12-20 13:04:35.564 (   0.458s) [main thread     ]Tests.hpp:405   INFO| 1.7%
2020-12-20 13:04:35.586 (   0.480s) [main thread     ]Tests.hpp:405   INFO| 1.8%
2020-12-20 13:04:35.608 (   0.503s) [main thread     ]Tests.hpp:405   INFO| 1.9%
2020-12-20 13:04:35.636 (   0.531s) [main thread     ]Tests.hpp:405   INFO| 2%
2020-12-20 13:04:35.658 (   0.552s) [main thread     ]Tests.hpp:405   INFO| 2.1%
2020-12-20 13:04:35.679 (   0.574s) [main thread     ]Tests.hpp:405   INFO| 2.2%
2020-12-20 13:04:35.702 (   0.597s) [main thread     ]Tests.hpp:405   INFO| 2.3%
2020-12-20 13:04:35.727 (   0.622s) [main thread     ]Tests.hpp:405   INFO| 2.4%
2020-12-20 13:04:35.769 (   0.664s) [main thread     ]Tests.hpp:405   INFO| 2.5%
2020-12-20 13:04:35.802 (   0.697s) [main thread     ]Tests.hpp:405   INFO| 2.6%
2020-12-20 13:04:35.831 (   0.726s) [main thread     ]Tests.hpp:405   INFO| 2.7%
2020-12-20 13:04:35.860 (   0.754s) [main thread     ]Tests.hpp:405   INFO| 2.8%
2020-12-20 13:04:35.887 (   0.781s) [main thread     ]Tests.hpp:405   INFO| 2.9%
2020-12-20 13:04:35.924 (   0.818s) [main thread     ]Tests.hpp:405   INFO| 3%
2020-12-20 13:04:35.956 (   0.850s) [main thread     ]Tests.hpp:405   INFO| 3.1%
2020-12-20 13:04:35.998 (   0.893s) [main thread     ]Tests.hpp:405   INFO| 3.2%
2020-12-20 13:04:36.033 (   0.928s) [main thread     ]Tests.hpp:405   INFO| 3.3%
2020-12-20 13:04:36.060 (   0.955s) [main thread     ]Tests.hpp:405   INFO| 3.4%
2020-12-20 13:04:36.102 (   0.997s) [main thread     ]Tests.hpp:405   INFO| 3.5%
2020-12-20 13:04:36.132 (   1.026s) [main thread     ]Tests.hpp:405   INFO| 3.6%
...
2020-12-20 13:05:03.456 (  28.351s) [main thread     ]Tests.hpp:410   INFO| Values found count: 10650389
2020-12-20 13:05:03.456 (  28.351s) [main thread     ]          benchmark.cpp:31    INFO| Parsing binary file took 28.341 second(s)

Parsing those 419 MB always takes around 28 - 70 seconds.解析那些419 MB总是需要大约 28 - 70 秒。 Even compiling in Release mode does not really help.即使在Release模式下编译也无济于事。 Is there any way to cut this time down?有什么办法可以缩短这个时间吗？ It doesn't seem like the operation I'm performing should be that inefficient.我正在执行的操作似乎不应该那么低效。

Note that I'm compiling for Linux 64-bit using GCC 10 .请注意，我正在使用GCC 10为Linux 64-bit进行编译。

EDIT:编辑：
As suggested in the comments, using memory-mapped file s with advise() also does not help the performance:正如评论中所建议的，使用带有advise()的memory-mapped file也无助于提高性能：

boost::interprocess::file_mapping file_mapping(input_file_path.string().data(), boost::interprocess::read_only);
boost::interprocess::mapped_region mapped_region(file_mapping, boost::interprocess::read_only);
mapped_region.advise(boost::interprocess::mapped_region::advice_sequential);
const auto file_buffer = (char *) mapped_region.get_address();
const auto file_buffer_size = mapped_region.get_size();
...

Lessons learned so far by taking into account the comments/answers:考虑到评论/答案，到目前为止吸取的教训：

Using advise(boost::interprocess::mapped_region::advice_sequential) does not help使用advise(boost::interprocess::mapped_region::advice_sequential)没有帮助
Not calling reserve() or calling it with exactly the right size can double the performance不调用reserve()或以完全正确的大小调用它可以使性能翻倍
Iterating directly on int * is a bit slower than iterating on a char *直接在int *上迭代比在char *上迭代要慢一些
Using a std::set is a bit slower than a std::vector for collecting the results使用std::set比std::vector收集结果要慢一些
The progress logging is insignificant for the performance进度记录对性能来说是微不足道的

Answer 1

As hinted by xanatos memory-mapped file s are deceiving in performance since they don't really read the entire file into memory in an instant.正如xanatos memory-mapped file所暗示的那样，它们在性能上具有欺骗性，因为它们不会立即将整个文件读入 memory。 During processing, multiple disk accesses are caused on page misses, severely degrading the performance.在处理过程中，页面未命中会导致多次磁盘访问，从而严重降低性能。

In this case it is more efficient to read the entire file into the memory first and then iterating through the memory:在这种情况下，首先将整个文件读入 memory 然后遍历 memory 会更有效：

inline std::vector<std::byte> load_file_into_memory(const std::filesystem::path &file_path) {
    std::ifstream input_stream(file_path, std::ios::binary | std::ios::ate);

    if (input_stream.fail()) {
        const auto error_message = "Opening " + file_path.string() + " failed";
        throw std::runtime_error(error_message);
    }

    auto current_read_position = input_stream.tellg();
    input_stream.seekg(0, std::ios::beg);

    auto file_size = std::size_t(current_read_position - input_stream.tellg());
    if (file_size == 0) {
        return {};
    }

    std::vector<std::byte> buffer(file_size);

    if (!input_stream.read((char *) buffer.data(), buffer.size())) {
        const auto error_message = "Reading from " + file_path.string() + " failed";
        throw std::runtime_error(error_message);
    }

    return buffer;
}

Now the performance is much more acceptable with roughly 3 - 15 seconds in total.现在性能更容易接受，总共大约3 - 15 seconds 。

Answer 2

This reminds me of my first encounter with slowness, some 40 years ago.这让我想起了大约 40 年前我第一次遇到缓慢。 Caused by a percentage bar measuring the progress.由衡量进度的百分比条引起。 Comment that part out and measure again.注释掉该部分并再次测量。 Also measure the capacity reserve, and check the actual capacity neeeded - if it is 1% then you are wasting space and hence time.还要测量容量储备，并检查所需的实际容量——如果是 1%，那么你就是在浪费空间，从而浪费时间。

unsigned long long might be costly. unsigned long long可能代价高昂。 Is not unsignedlong sufficient? unsignedlong还不够吗？
Modulo, division might be extra costly.模，除法可能会额外增加成本。
The progress logging might be slow, optimally a separate thread and then check whether flushing (contra-intuitively) might not be faster.进度记录可能很慢，最好是一个单独的线程，然后检查刷新（违反直觉）是否可能不会更快。

So:所以：

const auto pct_factor = file_buffer_size == 0 ? 0.0 : 100 / (double)file_buffer_size;
values.reserve(file_buffer_size / sizeof(int));
for (auto file_buffer_index = 0, long pct_countdown = 0; file_buffer_index < file_buffer_size; file_buffer_index += sizeof(int)) {
    const auto value = *(int *) &file_buffer[file_buffer_index];
    if (value >= 0x30000000 && value < 0x49000000 - sizeof(int) + 1) {
        values.push_back(value);
    }

    if (pct_countdown-- < 0) {
        pct_countdown = printed_lap;
        const auto pct = file_buffer_index * pct_factor;
        LOG_S(INFO) << std::setprecision(4) << pct << "%";
    }
}

Integer percentages would even be better. Integer 百分比甚至会更好。 Discarding precision somewhat.稍微放弃精度。
The bulk data values - is it needed as such.批量数据values - 是否需要它。 A set could be sufficient.一套可能就足够了。

I admit that I have my doubts on *(int *) .我承认我对*(int *)有疑问。 Using an int* pointer and increasing that seems more direct too.使用int*指针并增加它似乎也更直接。

使用内存映射文件在 C++ 中解析二进制文件太慢

问题描述

2 个解决方案

解决方案1
1 2020-12-20 13:09:18

解决方案2
0 2020-12-20 11:20:08

使用内存映射文件在 C++ 中解析二进制文件太慢

问题描述

2 个解决方案

解决方案1 1 2020-12-20 13:09:18

解决方案2 0 2020-12-20 11:20:08

解决方案1
1 2020-12-20 13:09:18

解决方案2
0 2020-12-20 11:20:08