[英]Why do arrays of different integer sizes have different performance?
I have following issue: 我有以下问题:
The write times to an std::array
for int8
, int16
, int32
and int64
are doubling with each size increase. 对于
int8
, int16
, int32
和int64
, std::array
的写入时间随着每个大小的增加而翻倍。 I can understand such behavior for an 8-bit CPU, but not 32/64-bit. 我可以理解8位CPU的这种行为,但不能理解32/64位。
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values? 为什么32位系统需要4倍的时间来保存32位值而不是保存8位值?
Here is my test code: 这是我的测试代码:
#include <iostream>
#include <array>
#include <chrono>
std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;
void PutZero()
{
auto point1 = std::chrono::high_resolution_clock::now();
for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}
int main()
{
PutZero();
std::cout << std::endl << "Press enter to exit" << std::endl;
std::cin.get();
return 0;
}
I compile it under linux with: g++ -o array_issue_1 main.cpp -O3 -std=c++14
我在linux下编译它:
g++ -o array_issue_1 main.cpp -O3 -std=c++14
and my results are following: 我的结果如下:
Time of processing int8 array: 9922us.
Time of processing int16 array: 37717us.
Time of processing int32 array: 76064us.
Time of processing int64 array: 146803us.
If I compile with -O2
, then results are 5 times worse for int8
! 如果我用
-O2
编译,那么int8
结果会差5倍!
You can also compile this source in Windows. 您也可以在Windows中编译此源代码。 You will get similar relation between results.
结果之间会得到类似的关系。
Update #1 更新#1
When I compile with -O2, then my results are following: 当我用-O2编译时,我的结果如下:
Time of processing int8 array: 60182us.
Time of processing int16 array: 77807us.
Time of processing int32 array: 114204us.
Time of processing int64 array: 186664us.
I didn't analyze assembler output. 我没有分析汇编程序输出。 My main point is that I would like to write efficient code in C++ and things like that show, that things like
std::array
can be challenging from performance perspective and somehow counter-intuitive. 我的主要观点是我想在C ++中编写高效的代码,这样的事情表明,像
std::array
这样的东西从性能角度来看可能具有挑战性,而且在某种程度上反直觉。
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?
为什么32位系统需要4倍的时间来保存32位值而不是保存8位值?
It doesn't. 它没有。 But there are 3 different issues with your benchmark that are giving you those results.
但是,您的基准测试有3个不同的问题可以为您提供这些结果。
-O3
is completely defeating your benchmark by converting all your loops into memset()
. -O3
的编译器通过将所有循环转换为memset()
来完全破坏您的基准。 Problem 1: The Test Data is not Prefaulted 问题1:测试数据未预先制定
Your arrays are declared, but not used before the benchmark. 您的数组已声明,但在基准测试之前未使用。 Because of the way the kernel and memory allocation works, they are not mapped into memory yet.
由于内核和内存分配的工作方式,它们尚未映射到内存中。 It's only when you first touch them does this happen.
只有当你第一次触摸它们时才会发生这种情况。 And when it does, it incurs a very large penalty from the kernel to map the page.
当它发生时,它会导致内核对页面进行映射。
This can be done by touching all the arrays before the benchmark. 这可以通过在基准测试之前触摸所有阵列来完成。
No Pre-Faulting: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18 没有预先故障: http : //coliru.stacked-crooked.com/a/1df1f3f9de420d18
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.
With Pre-Faulting: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128 预先故障: http : //coliru.stacked-crooked.com/a/7e62b9c7ca19c128
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.
The times drop by roughly a factor of 4. In other words, your original benchmark was measuring more of the kernel than the actual code. 时间下降了大约4倍。换句话说,您的原始基准测试的内核比实际代码更多。
Problem 2: The Compiler is Defeating the Benchmark 问题2:编译器正在击败基准测试
The compiler is recognizing your pattern of writing zeros and is completely replacing all your loops with calls to memset()
. 编译器正在识别您编写零的模式,并且通过调用
memset()
完全替换了所有循环。 So in effect, you're measuring calls to memset()
with different sizes. 因此,实际上,您正在测量具有不同大小的
memset()
调用。
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 67108864
mov edi, OFFSET FLAT:int8Array
mov r14, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 134217728
mov edi, OFFSET FLAT:int16Array
mov r13, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 268435456
mov edi, OFFSET FLAT:int32Array
mov r12, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 536870912
mov edi, OFFSET FLAT:int64Array
mov rbp, rax
call memset
call std::chrono::_V2::system_clock::now()
The optimization that's doing this is -ftree-loop-distribute-patterns
. 这样做的优化是
-ftree-loop-distribute-patterns
。 Even if you turn that off, the vectorizer will give you a similar effect. 即使你关闭它,矢量化器也会给你类似的效果。
With -O2
, vectorization and pattern recognition are both disabled. 使用
-O2
,矢量化和模式识别都被禁用。 So the compiler gives you what you write. 所以编译器会给你你写的东西。
.L4:
mov BYTE PTR [rax], 0 ;; <<------ 1 byte at a time
add rax, 1
cmp rdx, rax
jne .L4
call std::chrono::_V2::system_clock::now()
mov rbp, rax
mov eax, OFFSET FLAT:int16Array
lea rdx, [rax+134217728]
.L5:
xor ecx, ecx
add rax, 2
mov WORD PTR [rax-2], cx ;; <<------ 2 bytes at a time
cmp rdx, rax
jne .L5
call std::chrono::_V2::system_clock::now()
mov r12, rax
mov eax, OFFSET FLAT:int32Array
lea rdx, [rax+268435456]
.L6:
mov DWORD PTR [rax], 0 ;; <<------ 4 bytes at a time
add rax, 4
cmp rax, rdx
jne .L6
call std::chrono::_V2::system_clock::now()
mov r13, rax
mov eax, OFFSET FLAT:int64Array
lea rdx, [rax+536870912]
.L7:
mov QWORD PTR [rax], 0 ;; <<------ 8 bytes at a time
add rax, 8
cmp rdx, rax
jne .L7
call std::chrono::_V2::system_clock::now()
With -O2
: http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e 使用
-O2
: http : //coliru.stacked-crooked.com/a/edfdfaaf7ec2882e
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.
Now it's clear that the smaller word sizes are slower. 现在很明显,较小的字大小较慢。 But you would expect the times to be flat if all the word sizes were the same speed.
但是如果所有单词大小都是相同的速度,你会期望时间是平的。 And the reason they aren't is because of memory bandwidth.
它们不是因为内存带宽。
Problem 3: Memory Bandwidth 问题3:内存带宽
Because the benchmark (as written) is only writing zeros, it is easily saturating the memory bandwidth for the core/system. 由于基准测试(如编写的)仅写入零,因此很容易使核心/系统的内存带宽饱和。 So the benchmark becomes affected by how much memory is touched.
因此,基准测试会受到触及的内存量的影响。
To fix that, we need to shrink the dataset so that it fits into cache. 为了解决这个问题,我们需要缩小数据集以使其适合缓存。 To compensate for this, we loop over the same data multiple times.
为了弥补这一点,我们多次循环相同的数据。
std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;
...
auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
Now we see timings that are a lot more flat for the different word-sizes: 现在我们看到不同单词大小的时间更平坦:
http://coliru.stacked-crooked.com/a/f534f98f6d840c5c http://coliru.stacked-crooked.com/a/f534f98f6d840c5c
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.
The reason why it isn't completely flat is probably because there are numerous other factors involved with the compiler optimizations. 它不完全平坦的原因可能是因为编译器优化涉及许多其他因素。 You might need to resort to loop-unrolling to get any closer.
您可能需要求助于循环展开才能更接近。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.