为什么不同整数大小的数组具有不同的性能？

Question

I have following issue: 我有以下问题：

The write times to an std::array for int8 , int16 , int32 and int64 are doubling with each size increase. 对于int8 ， int16 ， int32和int64 ， std::array的写入时间随着每个大小的增加而翻倍。 I can understand such behavior for an 8-bit CPU, but not 32/64-bit. 我可以理解8位CPU的这种行为，但不能理解32/64位。

Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values? 为什么32位系统需要4倍的时间来保存32位值而不是保存8位值？

Here is my test code: 这是我的测试代码：

#include <iostream>
#include <array>
#include <chrono>

std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;

void PutZero()
{
    auto point1 = std::chrono::high_resolution_clock::now();
    for (auto &v : int8Array) v = 0;
    auto point2 = std::chrono::high_resolution_clock::now();
    for (auto &v : int16Array) v = 0;
    auto point3 = std::chrono::high_resolution_clock::now();
    for (auto &v : int32Array) v = 0;
    auto point4 = std::chrono::high_resolution_clock::now();
    for (auto &v : int64Array) v = 0;
    auto point5 = std::chrono::high_resolution_clock::now();
    std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
    std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
    std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
    std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}

int main()
{
    PutZero();
    std::cout << std::endl << "Press enter to exit" << std::endl;
    std::cin.get();
    return 0;
}

I compile it under linux with: g++ -o array_issue_1 main.cpp -O3 -std=c++14 我在linux下编译它： g++ -o array_issue_1 main.cpp -O3 -std=c++14

and my results are following: 我的结果如下：

Time of processing int8 array:  9922us.   
Time of processing int16 array: 37717us.   
Time of processing int32 array: 76064us.   
Time of processing int64 array: 146803us.

If I compile with -O2 , then results are 5 times worse for int8 ! 如果我用-O2编译，那么int8结果会差5倍！

You can also compile this source in Windows. 您也可以在Windows中编译此源代码。 You will get similar relation between results. 结果之间会得到类似的关系。

Update #1 更新＃1

When I compile with -O2, then my results are following: 当我用-O2编译时，我的结果如下：

Time of processing int8 array:  60182us.  
Time of processing int16 array: 77807us.  
Time of processing int32 array: 114204us.  
Time of processing int64 array: 186664us.

I didn't analyze assembler output. 我没有分析汇编程序输出。 My main point is that I would like to write efficient code in C++ and things like that show, that things like std::array can be challenging from performance perspective and somehow counter-intuitive. 我的主要观点是我想在C ++中编写高效的代码，这样的事情表明，像std::array这样的东西从性能角度来看可能具有挑战性，而且在某种程度上反直觉。

Answer 1

Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values? 为什么32位系统需要4倍的时间来保存32位值而不是保存8位值？

It doesn't. 它没有。 But there are 3 different issues with your benchmark that are giving you those results. 但是，您的基准测试有3个不同的问题可以为您提供这些结果。

You're not pre-faulting the memory. 你没有预先记忆。 So you're page-faulting the arrays during the benchmark. 因此，您在基准测试期间会对数组进行页面错误处理。 These page faults along with the OS kernel interaction are a dominant factor in the time. 这些页面错误以及OS内核交互是当时的主导因素。
The compiler with -O3 is completely defeating your benchmark by converting all your loops into memset() . 使用-O3的编译器通过将所有循环转换为memset()来完全破坏您的基准。
Your benchmark is memory-bound. 您的基准测试是受内存限制的。 So you're measuring the speed of your memory instead of the core. 所以你要衡量记忆的速度而不是核心速度。

Problem 1: The Test Data is not Prefaulted 问题1：测试数据未预先制定

Your arrays are declared, but not used before the benchmark. 您的数组已声明，但在基准测试之前未使用。 Because of the way the kernel and memory allocation works, they are not mapped into memory yet. 由于内核和内存分配的工作方式，它们尚未映射到内存中。 It's only when you first touch them does this happen. 只有当你第一次触摸它们时才会发生这种情况。 And when it does, it incurs a very large penalty from the kernel to map the page. 当它发生时，它会导致内核对页面进行映射。

This can be done by touching all the arrays before the benchmark. 这可以通过在基准测试之前触摸所有阵列来完成。

No Pre-Faulting: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18 没有预先故障： http ： //coliru.stacked-crooked.com/a/1df1f3f9de420d18

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.

With Pre-Faulting: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128 预先故障： http ： //coliru.stacked-crooked.com/a/7e62b9c7ca19c128

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.

The times drop by roughly a factor of 4. In other words, your original benchmark was measuring more of the kernel than the actual code. 时间下降了大约4倍。换句话说，您的原始基准测试的内核比实际代码更多。

Problem 2: The Compiler is Defeating the Benchmark 问题2：编译器正在击败基准测试

The compiler is recognizing your pattern of writing zeros and is completely replacing all your loops with calls to memset() . 编译器正在识别您编写零的模式，并且通过调用memset()完全替换了所有循环。 So in effect, you're measuring calls to memset() with different sizes. 因此，实际上，您正在测量具有不同大小的memset()调用。

  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 67108864
  mov edi, OFFSET FLAT:int8Array
  mov r14, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 134217728
  mov edi, OFFSET FLAT:int16Array
  mov r13, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 268435456
  mov edi, OFFSET FLAT:int32Array
  mov r12, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 536870912
  mov edi, OFFSET FLAT:int64Array
  mov rbp, rax
  call memset
  call std::chrono::_V2::system_clock::now()

The optimization that's doing this is -ftree-loop-distribute-patterns . 这样做的优化是-ftree-loop-distribute-patterns 。 Even if you turn that off, the vectorizer will give you a similar effect. 即使你关闭它，矢量化器也会给你类似的效果。

With -O2 , vectorization and pattern recognition are both disabled. 使用-O2 ，矢量化和模式识别都被禁用。 So the compiler gives you what you write. 所以编译器会给你你写的东西。

.L4:
  mov BYTE PTR [rax], 0         ;; <<------ 1 byte at a time
  add rax, 1
  cmp rdx, rax
  jne .L4
  call std::chrono::_V2::system_clock::now()
  mov rbp, rax
  mov eax, OFFSET FLAT:int16Array
  lea rdx, [rax+134217728]
.L5:
  xor ecx, ecx
  add rax, 2
  mov WORD PTR [rax-2], cx      ;; <<------ 2 bytes at a time
  cmp rdx, rax
  jne .L5
  call std::chrono::_V2::system_clock::now()
  mov r12, rax
  mov eax, OFFSET FLAT:int32Array
  lea rdx, [rax+268435456]
.L6:
  mov DWORD PTR [rax], 0        ;; <<------ 4 bytes at a time
  add rax, 4
  cmp rax, rdx
  jne .L6
  call std::chrono::_V2::system_clock::now()
  mov r13, rax
  mov eax, OFFSET FLAT:int64Array
  lea rdx, [rax+536870912]
.L7:
  mov QWORD PTR [rax], 0        ;; <<------ 8 bytes at a time
  add rax, 8
  cmp rdx, rax
  jne .L7
  call std::chrono::_V2::system_clock::now()

With -O2 : http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e 使用-O2 ： http ： //coliru.stacked-crooked.com/a/edfdfaaf7ec2882e

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.

Now it's clear that the smaller word sizes are slower. 现在很明显，较小的字大小较慢。 But you would expect the times to be flat if all the word sizes were the same speed. 但是如果所有单词大小都是相同的速度，你会期望时间是平的。 And the reason they aren't is because of memory bandwidth. 它们不是因为内存带宽。

Problem 3: Memory Bandwidth 问题3：内存带宽

Because the benchmark (as written) is only writing zeros, it is easily saturating the memory bandwidth for the core/system. 由于基准测试（如编写的）仅写入零，因此很容易使核心/系统的内存带宽饱和。 So the benchmark becomes affected by how much memory is touched. 因此，基准测试会受到触及的内存量的影响。

To fix that, we need to shrink the dataset so that it fits into cache. 为了解决这个问题，我们需要缩小数据集以使其适合缓存。 To compensate for this, we loop over the same data multiple times. 为了弥补这一点，我们多次循环相同的数据。

std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;

...

auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();

Now we see timings that are a lot more flat for the different word-sizes: 现在我们看到不同单词大小的时间更平坦：

http://coliru.stacked-crooked.com/a/f534f98f6d840c5c http://coliru.stacked-crooked.com/a/f534f98f6d840c5c

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.

The reason why it isn't completely flat is probably because there are numerous other factors involved with the compiler optimizations. 它不完全平坦的原因可能是因为编译器优化涉及许多其他因素。 You might need to resort to loop-unrolling to get any closer. 您可能需要求助于循环展开才能更接近。

为什么不同整数大小的数组具有不同的性能？

问题描述

1 个解决方案

解决方案1
64 已采纳 2017-10-18 18:32:08

为什么不同整数大小的数组具有不同的性能？

问题描述

1 个解决方案

解决方案1 64 已采纳 2017-10-18 18:32:08

解决方案1
64 已采纳 2017-10-18 18:32:08