简体   繁体   English

如何防止编译器优化对基准read()与mmap()性能的内存访问?

[英]How to prevent the compiler from optimizing memory access to benchmark read() vs mmap() performance?

I would like to benchmark read() vs mmap() performance of a C program reading 10GB of data. 我想对读取10GB数据的C程序的read()vs mmap()性能进行基准测试。 If I have read or mmap'ed the data to a buffer, what should be done in order to make sure the data was actually read? 如果我已将数据读取或映射到缓冲区,应该怎么做才能确保实际读取了数据?

At the moment I use the following function after each single read() and after the one mmap() operation to make sure data is actually in memory: 目前,在每个单个read()之后和一个mmap()操作之后,我使用以下函数来确保数据确实在内存中:

void use_data(void *data, size_t length) {
    volatile int c = 0;
    for (size_t i = 0; i < length; i++) {
        c += *((char *) data + i);
    }
}

However, I feel this might even introduce overhead? 但是,我觉得这可能会带来开销? Maybe one can even distinguish between read() and mmap(): 也许甚至可以区分read()和mmap():

In the read() case I think no explicit data access is needed, because the read() call will copy the data to a buffer anyway. 在read()的情况下,我认为不需要显式的数据访问,因为read()调用无论如何都会将数据复制到缓冲区。 In the case of mmap() however, I think some kind of summing up/counting need to be performed in order to make the kernel load every page. 但是在mmap()的情况下,我认为需要执行某种类型的汇总/计数才能使内核加载每个页面。

Any recommendations? 有什么建议吗?

You don't need to access the volatile variable for each byte you process. 您不需要为每个处理的字节访问volatile变量。 Sum all bytes into a local. 将所有字节求和到本地。 Then, write the sum into a volatile variable. 然后,将总和写入volatile变量。

In fact you don't need a volatile variable. 实际上,您不需要一个volatile变量。 You can use any opaque sink that the compiler cannot prove as unneeded. 您可以使用编译器无法证明不需要的任何不透明接收器。 Writing the sum to a temp file would be guaranteed to work as well. 将总和写入临时文件也可以保证正常工作。

Note, that this is not just a hack to make the compiler cooperate. 请注意,这不仅仅是使编译器合作的一种手段。 This is guaranteed to touch every byte (because it could influence the result). 保证触摸每个字节(因为这可能会影响结果)。 The result is needed for an external IO. 外部IO需要结果。 This cannot be optimized away under the standard. 这不能在标准下进行优化。

If alignment allows, sum in bigger units such as 32 or 64 bits. 如果对齐允许,则以更大的单位求和,例如32或64位。 Use unsigned types to avoid UB on overflow. 使用无符号类型可以避免UB溢出。 You want to be memory/IO bound, not ALU bound. 您想绑定内存/ IO,而不绑定ALU。 You can create instruction-level parallelism by summing multiple independent streams using multiple local accumulator variables. 您可以使用多个本地累加器变量求和多个独立的流,从而创建指令级并行性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM