[英]Significant performance discrepancy between two pieces of C++ code
I am having the trouble to understand why these two following routines do not have the expected difference in term of performance.我很难理解为什么以下两个例程在性能方面没有预期的差异。
void matmul1(const float *matrix, const float *vector, float *output, uint32_t input_height, uint32_t input_width) {
for (uint32_t y = 0; y < input_height; y++) {
for (uint32_t x = 0; x < input_width; x++) {
output[y] += matrix[y * input_width + x] * vector[x];
}
}
}
void matmul2(const float *matrix, const float *vector, float *output, uint32_t input_height, uint32_t input_width) {
for (uint32_t y = 0; y < input_height; y++) {
for (uint32_t x = 0; x < input_width; x++) {
output[y] += *matrix++ * vector[x];
}
}
}
I repeat the executions of two functions on random data for 100 times on the same machines.我在同一台机器上对随机数据重复执行两个函数 100 次。 The function
matmul1
has a mean running time of 21298μs
and the function matmul2
has a mean running time of 24034μs
. function
matmul1
的平均运行时间为21298μs
,function matmul2
的平均运行时间为24034μs
。 The standard deviation of the samples are 198
and 171
.样本的标准差为
198
和171
。
Disassembly https://godbolt.org/z/of3zM4 gives me this (I am not fit enough with assembly to interpret the results properly)拆卸https://godbolt.org/z/of3zM4给了我这个(我不适合组装来正确解释结果)
matmul1(float const*, float const*, float*, unsigned int, unsigned int):
mov w8, 0
mov x7, 0
cbz w3, .L1
.L2:
cbz w4, .L5
ldr s0, [x2, x7, lsl 2]
mov x5, 0
.L6:
add w6, w8, w5
ldr s1, [x1, x5, lsl 2]
add x5, x5, 1
cmp w4, w5
ldr s2, [x0, x6, lsl 2]
fmadd s0, s2, s1, s0
str s0, [x2, x7, lsl 2]
bhi .L6
.L5:
add x7, x7, 1
add w8, w8, w4
cmp w3, w7
bhi .L2
.L1:
ret
matmul2(float const*, float const*, float*, unsigned int, unsigned int):
cbz w3, .L10
sub w7, w4, #1
mov x6, 0
add x7, x7, 1
lsl x7, x7, 2
.L15:
cbz w4, .L12
ldr s0, [x2, x6, lsl 2]
mov x5, 0
.L13:
ldr s2, [x0, x5, lsl 2]
ldr s1, [x1, x5, lsl 2]
add x5, x5, 1
cmp w4, w5
fmadd s0, s2, s1, s0
str s0, [x2, x6, lsl 2]
bhi .L13
add x0, x0, x7
.L12:
add x6, x6, 1
cmp w3, w6
bhi .L15
.L10:
ret
I also ran the code on different optimization levels, different input sizes.我还在不同的优化级别、不同的输入大小上运行代码。 Every time the first function would beat the second function.
每次第一个 function 都会击败第二个 function。 Why is that so?
为什么呢? I expect the first function to be slower than the second, since it has one multiplication more in the inner loop, which is not the case.
我希望第一个 function 比第二个慢,因为它在内部循环中多了一个乘法,但事实并非如此。
I run the code on Raspberry Pi 4 with g++ 8.3.0我使用 g++ 8.3.0 在 Raspberry Pi 4 上运行代码
Update 1:更新1:
I took some timings and actually my suggestions runs slower than the 'matmul2'.我花了一些时间,实际上我的建议比'matmul2'运行得慢。 See below!
见下文!
Try to replace尝试更换
output[y] += *matrix++ * vector[x];
in the second loop with在第二个循环中
output[y] += *(++matrix) * vector[x];
Ie replace the post-increment of the pointer with a pre-increment.即用前增量替换指针的后增量。 If you use a post-increment a temporary copy of the pointer is created and its value is used.
如果您使用后增量,则会创建指针的临时副本并使用其值。 Because this happens each time the run-time increases.
因为每次运行时间增加时都会发生这种情况。 If you use the pre-increment this temporary copy is not necessary.
如果您使用预增量,则不需要此临时副本。
I am not sure, if the compiler can optimize this part.我不确定编译器是否可以优化这部分。 Because you use a pointer it might be not possible to avoid side-effects.
因为您使用指针,所以可能无法避免副作用。 Therefore, it stays untouched.
因此,它保持不变。
Double-check the results because the semantic slightly changes.仔细检查结果,因为语义略有变化。
Update 1:更新1:
I implemented the following functions and took some timings.我实现了以下功能并花了一些时间。
matmul1
is the slowest version. matmul1
是最慢的版本。 matmul2
is the fastest version on my machine. matmul2
是我机器上最快的版本。 I did not expect matmul3
to be slower.我没想到
matmul3
会更慢。 Timings for 1000 repeats and no optimization are as follows: 1000 次重复且无优化的时序如下:
matmul1 - 573.62 ms matmul1 - 573.62 毫秒
matmul2 - 512.58 ms matmul2 - 512.58 毫秒
matmul3 - 534.63 ms matmul3 - 534.63 毫秒
#include <chrono>
#include <iostream>
#include <vector>
using namespace std;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
void
long_operation(vector<int>& vec) {
/* Simulating a long, heavy operation. */
for (size_t i = 0; i < vec.size(); ++i)
vec[i] += i;
}
void
matmul1(const float* matrix,
const float* vector,
float* output,
uint32_t input_height,
uint32_t input_width) {
for (uint32_t y = 0; y < input_height; y++) {
for (uint32_t x = 0; x < input_width; x++) {
output[y] += matrix[y * input_width + x] * vector[x];
}
}
}
void
matmul2(const float* matrix,
const float* vector,
float* output,
uint32_t input_height,
uint32_t input_width) {
for (uint32_t y = 0; y < input_height; y++) {
for (uint32_t x = 0; x < input_width; x++) {
output[y] += *matrix++ * vector[x];
}
}
}
void
matmul3(const float* matrix,
const float* vector,
float* output,
uint32_t input_height,
uint32_t input_width) {
for (uint32_t y = 0; y < input_height; y++) {
for (uint32_t x = 0; x < input_width; x++) {
output[y] += *(++matrix) * vector[x];
}
}
}
int
main() {
//--- prepare some test data ---//
uint32_t height = 100;
uint32_t width = 200;
const uint32_t size = height * width;
float matrix[size];
float vector[size];
float output[size];
for (uint32_t i = 0; i < size; ++i) {
matrix[i] = i;
vector[i] = i * i;
output[i] = 0.0;
}
//--- test timings ---//
double time1 = 0.0;
double time2 = 0.0;
double time3 = 0.0;
int repeat = 0;
for (repeat = 0; repeat < 10000; ++repeat) {
//--- version 1
auto t1 = high_resolution_clock::now();
matmul1(matrix, vector, output, height, width);
auto t2 = high_resolution_clock::now();
duration<double, std::milli> ms_double = t2 - t1;
time1 += ms_double.count();
//--- version 2
t1 = high_resolution_clock::now();
matmul2(matrix, vector, output, height, width);
t2 = high_resolution_clock::now();
ms_double = t2 - t1;
time2 += ms_double.count();
//--- version 3
t1 = high_resolution_clock::now();
matmul3(matrix, vector, output, height, width);
t2 = high_resolution_clock::now();
ms_double = t2 - t1;
time3 += ms_double.count();
}
std::cout << "total time 1: " << time1 << "ms\n";
std::cout << "total time 2: " << time2 << "ms\n";
std::cout << "total time 3: " << time3 << "ms\n" << endl;
time1 /= repeat;
time2 /= repeat;
time3 /= repeat;
cout << "average time 1: " << time1 << "ms\n";
cout << "average time 2: " << time2 << "ms\n";
cout << "average time 3: " << time3 << "ms" << endl;
return 0;
}
With full optimization (-O3) the timings are almost the same.使用完全优化(-O3),时间几乎相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.