为什么不同整数大小的数组具有不同的性能？

Question

我有以下问题：

对于int8 ， int16 ， int32和int64 ， std::array的写入时间随着每个大小的增加而翻倍。 我可以理解8位CPU的这种行为，但不能理解32/64位。

为什么32位系统需要4倍的时间来保存32位值而不是保存8位值？

这是我的测试代码：

#include <iostream>
#include <array>
#include <chrono>

std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;

void PutZero()
{
    auto point1 = std::chrono::high_resolution_clock::now();
    for (auto &v : int8Array) v = 0;
    auto point2 = std::chrono::high_resolution_clock::now();
    for (auto &v : int16Array) v = 0;
    auto point3 = std::chrono::high_resolution_clock::now();
    for (auto &v : int32Array) v = 0;
    auto point4 = std::chrono::high_resolution_clock::now();
    for (auto &v : int64Array) v = 0;
    auto point5 = std::chrono::high_resolution_clock::now();
    std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
    std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
    std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
    std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}

int main()
{
    PutZero();
    std::cout << std::endl << "Press enter to exit" << std::endl;
    std::cin.get();
    return 0;
}

我在linux下编译它： g++ -o array_issue_1 main.cpp -O3 -std=c++14

我的结果如下：

Time of processing int8 array:  9922us.   
Time of processing int16 array: 37717us.   
Time of processing int32 array: 76064us.   
Time of processing int64 array: 146803us.

如果我用-O2编译，那么int8结果会差5倍！

您也可以在Windows中编译此源代码。 结果之间会得到类似的关系。

更新＃1

当我用-O2编译时，我的结果如下：

Time of processing int8 array:  60182us.  
Time of processing int16 array: 77807us.  
Time of processing int32 array: 114204us.  
Time of processing int64 array: 186664us.

我没有分析汇编程序输出。 我的主要观点是我想在C ++中编写高效的代码，这样的事情表明，像std::array这样的东西从性能角度来看可能具有挑战性，而且在某种程度上反直觉。

Answer 1

为什么32位系统需要4倍的时间来保存32位值而不是保存8位值？

它没有。 但是，您的基准测试有3个不同的问题可以为您提供这些结果。

你没有预先记忆。 因此，您在基准测试期间会对数组进行页面错误处理。 这些页面错误以及OS内核交互是当时的主导因素。
使用-O3的编译器通过将所有循环转换为memset()来完全破坏您的基准。
您的基准测试是受内存限制的。 所以你要衡量记忆的速度而不是核心速度。

问题1：测试数据未预先制定

您的数组已声明，但在基准测试之前未使用。 由于内核和内存分配的工作方式，它们尚未映射到内存中。 只有当你第一次触摸它们时才会发生这种情况。 当它发生时，它会导致内核对页面进行映射。

这可以通过在基准测试之前触摸所有阵列来完成。

没有预先故障： http ： //coliru.stacked-crooked.com/a/1df1f3f9de420d18

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.

预先故障： http ： //coliru.stacked-crooked.com/a/7e62b9c7ca19c128

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.

时间下降了大约4倍。换句话说，您的原始基准测试的内核比实际代码更多。

问题2：编译器正在击败基准测试

编译器正在识别您编写零的模式，并且通过调用memset()完全替换了所有循环。 因此，实际上，您正在测量具有不同大小的memset()调用。

  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 67108864
  mov edi, OFFSET FLAT:int8Array
  mov r14, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 134217728
  mov edi, OFFSET FLAT:int16Array
  mov r13, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 268435456
  mov edi, OFFSET FLAT:int32Array
  mov r12, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 536870912
  mov edi, OFFSET FLAT:int64Array
  mov rbp, rax
  call memset
  call std::chrono::_V2::system_clock::now()

这样做的优化是-ftree-loop-distribute-patterns 。 即使你关闭它，矢量化器也会给你类似的效果。

使用-O2 ，矢量化和模式识别都被禁用。 所以编译器会给你你写的东西。

.L4:
  mov BYTE PTR [rax], 0         ;; <<------ 1 byte at a time
  add rax, 1
  cmp rdx, rax
  jne .L4
  call std::chrono::_V2::system_clock::now()
  mov rbp, rax
  mov eax, OFFSET FLAT:int16Array
  lea rdx, [rax+134217728]
.L5:
  xor ecx, ecx
  add rax, 2
  mov WORD PTR [rax-2], cx      ;; <<------ 2 bytes at a time
  cmp rdx, rax
  jne .L5
  call std::chrono::_V2::system_clock::now()
  mov r12, rax
  mov eax, OFFSET FLAT:int32Array
  lea rdx, [rax+268435456]
.L6:
  mov DWORD PTR [rax], 0        ;; <<------ 4 bytes at a time
  add rax, 4
  cmp rax, rdx
  jne .L6
  call std::chrono::_V2::system_clock::now()
  mov r13, rax
  mov eax, OFFSET FLAT:int64Array
  lea rdx, [rax+536870912]
.L7:
  mov QWORD PTR [rax], 0        ;; <<------ 8 bytes at a time
  add rax, 8
  cmp rdx, rax
  jne .L7
  call std::chrono::_V2::system_clock::now()

使用-O2 ： http ： //coliru.stacked-crooked.com/a/edfdfaaf7ec2882e

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.

现在很明显，较小的字大小较慢。 但是如果所有单词大小都是相同的速度，你会期望时间是平的。 它们不是因为内存带宽。

问题3：内存带宽

由于基准测试（如编写的）仅写入零，因此很容易使核心/系统的内存带宽饱和。 因此，基准测试会受到触及的内存量的影响。

为了解决这个问题，我们需要缩小数据集以使其适合缓存。 为了弥补这一点，我们多次循环相同的数据。

std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;

...

auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();

现在我们看到不同单词大小的时间更平坦：

http://coliru.stacked-crooked.com/a/f534f98f6d840c5c

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.

它不完全平坦的原因可能是因为编译器优化涉及许多其他因素。 您可能需要求助于循环展开才能更接近。

为什么不同整数大小的数组具有不同的性能？

问题描述

1 个解决方案

解决方案1
64 已采纳 2017-10-18 18:32:08

为什么不同整数大小的数组具有不同的性能？

问题描述

1 个解决方案

解决方案1 64 已采纳 2017-10-18 18:32:08

解决方案1
64 已采纳 2017-10-18 18:32:08