![](/img/trans.png)
[英]Why does ctime clock() give different times when called twice by same program for same operation?
[英]Why would identical copies of the same C loop in the same program take significantly but consistently different times to execute?
我希望我將問題簡化為一個簡單且可重復的測試用例。 源( 在此處 )包含10個相同的簡單循環副本。 每個循環的形式如下:
#define COUNT (1000 * 1000 * 1000)
volatile uint64_t counter = 0;
void loopN(void) {
for (int j = COUNT; j != 0; j--) {
uint64_t val = counter;
val = val + 1;
counter = val;
}
return;
}
變量的'volatile'很重要,因為它強制值在每次迭代時從內存中讀取和寫入。 使用'-falign-loops = 64'將每個循環對齊到64個字節,並生成相同的程序集,除了對全局的相對偏移量:
400880: 48 8b 15 c1 07 20 00 mov 0x2007c1(%rip),%rdx # 601048 <counter>
400887: 48 83 c2 01 add $0x1,%rdx
40088b: 83 e8 01 sub $0x1,%eax
40088e: 48 89 15 b3 07 20 00 mov %rdx,0x2007b3(%rip) # 601048 <counter>
400895: 75 e9 jne 400880 <loop8+0x20>
我在Intel Haswell i7-4470上運行Linux 3.11。 我正在用GCC 4.8.1和命令行編譯程序:
gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function
我也在源代碼中使用屬性 ((noinline))來使程序集更清晰,但這不是觀察問題所必需的。 我發現shell循環中最快和最慢的函數:
for n in 0 1 2 3 4 5 6 7 8 9;
do echo same-function ${n}:;
/usr/bin/time -f "%e seconds" same-function ${n};
/usr/bin/time -f "%e seconds" same-function ${n};
/usr/bin/time -f "%e seconds" same-function ${n};
done
它產生的結果在運行到運行時大約為1%,最快和最慢函數的確切數量根據確切的二進制布局而變化:
same-function 0:
2.08 seconds
2.04 seconds
2.06 seconds
same-function 1:
2.12 seconds
2.12 seconds
2.12 seconds
same-function 2:
2.10 seconds
2.14 seconds
2.11 seconds
same-function 3:
2.04 seconds
2.04 seconds
2.05 seconds
same-function 4:
2.05 seconds
2.00 seconds
2.03 seconds
same-function 5:
2.07 seconds
2.07 seconds
1.98 seconds
same-function 6:
1.83 seconds
1.83 seconds
1.83 seconds
same-function 7:
1.95 seconds
1.98 seconds
1.95 seconds
same-function 8:
1.86 seconds
1.88 seconds
1.86 seconds
same-function 9:
2.04 seconds
2.04 seconds
2.02 seconds
在這種情況下,我們看到loop2()是最慢的執行之一,而loop6()是最快的之一,差異只有10%以上。 我們通過使用不同的方法重復測試這兩種情況來重新確認這一點:
nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
7,180,104,866 cycles # 3.391 GHz
7,169,930,711 cycles # 3.391 GHz
7,150,190,394 cycles # 3.391 GHz
7,188,959,096 cycles # 3.391 GHz
7,177,272,608 cycles # 3.391 GHz
7,093,246,955 cycles # 3.391 GHz
7,210,636,865 cycles # 3.391 GHz
7,239,838,211 cycles # 3.391 GHz
7,172,716,779 cycles # 3.391 GHz
7,223,252,964 cycles # 3.391 GHz
nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
6,234,770,361 cycles # 3.391 GHz
6,199,096,296 cycles # 3.391 GHz
6,213,348,126 cycles # 3.391 GHz
6,217,971,263 cycles # 3.391 GHz
6,224,779,686 cycles # 3.391 GHz
6,194,117,897 cycles # 3.391 GHz
6,225,259,274 cycles # 3.391 GHz
6,244,391,509 cycles # 3.391 GHz
6,189,972,381 cycles # 3.391 GHz
6,205,556,306 cycles # 3.391 GHz
考慮到這一點,我們重新閱讀了所有英特爾建築手冊中的每一個字,篩選整個網頁上提到“計算機”或“編程”字樣的每一頁,並在山頂上獨立思考6年。 沒有實現任何啟示,我們歸結為文明,刮胡子,洗個澡,並問StackOverflow的專家:
這可能發生什么?
編輯:在本傑明的幫助下(見下面的答案)我想出了一個更簡潔的測試用例 。 這是一個獨立的20行裝配線。 即使結果保持不變並且執行相同數量的指令,從使用SUB變為SBB也會導致性能差異達到15%。 解釋嗎? 我想我越來越接近一個。
; Minimal example, see also http://stackoverflow.com/q/26266953/3766665
; To build (Linux):
; nasm -felf64 func.asm
; ld func.o
; Then run:
; perf stat -r10 ./a.out
; On Haswell and Sandy Bridge, observed runtime varies
; ~15% depending on whether sub or sbb is used in the loop
section .text
global _start
_start:
push qword 0h ; put counter variable on stack
jmp loop ; jump to function
align 64 ; function alignment.
loop:
mov rcx, 1000000000
align 64 ; loop alignment.
l:
mov rax, [rsp]
add rax, 1h
mov [rsp], rax
; sbb rcx, 1h ; which is faster: sbb or sub?
sub rcx, 1h ; switch, time it, and find out
jne l ; (rot13 spoiler: foo vf snfgre ol 15%)
fin: ; If that was too easy, explain why.
mov eax, 60
xor edi, edi ; End of program. Exit with code 0
syscall
查看完整的perf stat輸出,您將看到它不是指令的數量,而是停止的周期數。
看看反匯編,我發現了兩件事:
在我的機器上進行上述更改測試然后產生:
for f in $( seq 7); do perf stat -e cycles -r3 ./same-function $f 2>&1; done|grep cycles 6,070,933,420 cycles ( +- 0.11% ) 6,052,771,142 cycles ( +- 0.06% ) 6,099,676,333 cycles ( +- 0.07% ) 6,092,962,697 cycles ( +- 0.16% ) 6,151,861,993 cycles ( +- 0.69% ) 6,074,323,033 cycles ( +- 0.36% ) 6,174,434,653 cycles ( +- 0.65% )
不過,我不太了解你找到的攤位的性質。
編輯:我在每個函數中都使用了一個volatile成員,在我的I7-3537U上測試了不同的編譯,發現`-falign-loops = 64'實際上是最慢的:
$ gcc -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function
$ gcc -falign-loops=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-l64
$ gcc -falign-functions=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-f64
$ for prog in same-function{,-l64,-f64}; do echo $prog; for f in $(seq 7); do perf stat -e cycles -r10 ./$prog $f 2>&1; done|grep cycl; done
same-function
6,079,966,292 cycles ( +- 0.19% )
7,419,053,569 cycles ( +- 0.07% )
6,136,061,105 cycles ( +- 0.27% )
7,282,434,896 cycles ( +- 0.74% )
6,104,866,406 cycles ( +- 0.16% )
7,342,985,942 cycles ( +- 0.52% )
6,208,373,040 cycles ( +- 0.50% )
same-function-l64
7,336,838,175 cycles ( +- 0.46% )
7,358,913,923 cycles ( +- 0.52% )
7,412,570,515 cycles ( +- 0.38% )
7,435,048,756 cycles ( +- 0.10% )
7,404,834,458 cycles ( +- 0.34% )
7,291,095,582 cycles ( +- 0.99% )
7,312,052,598 cycles ( +- 0.95% )
same-function-f64
6,103,059,996 cycles ( +- 0.12% )
6,116,601,533 cycles ( +- 0.29% )
6,120,841,824 cycles ( +- 0.18% )
6,114,278,098 cycles ( +- 0.09% )
6,105,938,586 cycles ( +- 0.14% )
6,101,672,717 cycles ( +- 0.19% )
6,121,339,944 cycles ( +- 0.11% )
有關align-loops與align-functions的更多詳細信息:
$ for prog in same-function{-l64,-f64}; do sudo perf stat -d -r10 ./$prog 0; done
Performance counter stats for './same-function-l64 0' (10 runs):
2396.608194 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.64% )
56 context-switches:HG # 0.024 K/sec ( +- 5.51% )
1 cpu-migrations:HG # 0.000 K/sec ( +- 74.78% )
46 page-faults:HG # 0.019 K/sec ( +- 0.63% )
7,331,450,530 cycles:HG # 3.059 GHz ( +- 0.51% ) [85.68%]
5,332,248,218 stalled-cycles-frontend:HG # 72.73% frontend cycles idle ( +- 0.71% ) [71.42%]
<not supported> stalled-cycles-backend:HG
5,000,800,933 instructions:HG # 0.68 insns per cycle
# 1.07 stalled cycles per insn ( +- 0.04% ) [85.73%]
1,000,446,303 branches:HG # 417.443 M/sec ( +- 0.04% ) [85.75%]
8,461 branch-misses:HG # 0.00% of all branches ( +- 6.05% ) [85.76%]
<not supported> L1-dcache-loads:HG
45,593 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 3.61% ) [85.77%]
6,148 LLC-loads:HG # 0.003 M/sec ( +- 8.80% ) [71.36%]
<not supported> LLC-load-misses:HG
2.394456699 seconds time elapsed ( +- 0.64% )
Performance counter stats for './same-function-f64 0' (10 runs):
1998.936383 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.61% )
60 context-switches:HG # 0.030 K/sec ( +- 17.77% )
1 cpu-migrations:HG # 0.001 K/sec ( +- 47.86% )
46 page-faults:HG # 0.023 K/sec ( +- 0.68% )
6,107,877,836 cycles:HG # 3.056 GHz ( +- 0.34% ) [85.63%]
4,112,602,649 stalled-cycles-frontend:HG # 67.33% frontend cycles idle ( +- 0.52% ) [71.41%]
<not supported> stalled-cycles-backend:HG
5,000,910,172 instructions:HG # 0.82 insns per cycle
# 0.82 stalled cycles per insn ( +- 0.01% ) [85.72%]
1,000,423,026 branches:HG # 500.478 M/sec ( +- 0.02% ) [85.77%]
10,660 branch-misses:HG # 0.00% of all branches ( +- 13.23% ) [85.80%]
<not supported> L1-dcache-loads:HG
47,492 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 14.82% ) [85.80%]
11,719 LLC-loads:HG # 0.006 M/sec ( +- 42.44% ) [71.28%]
<not supported> LLC-load-misses:HG
1.997319759 seconds time elapsed ( +- 0.62% )
兩個可執行文件都有非常低的指令/循環計數,可能是由於循環的簡約性質和它造成的內存壓力,但我不知道為什么一個比另一個更差。
我也嘗試過類似的東西
$ for prog in same-function{-l64,-f64}; do sudo perf stat -eL1-{d,i}cache-load-misses,L1-dcache-store-misses,cs,cycles,instructions -r10 ./$prog 0; done
和
$ sudo perf record -F25000 -e'{cycles:pp,stalled-cycles-frontend}' ./same-function-l64 0
[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 6.771 MB perf.data (~295841 samples) ]
$ sudo perf report --group -Sloop0 -n --show-total-period --stdio
$ sudo perf annotate --group -sloop0 --stdio
但找不到罪魁禍首沒有任何成功。 盡管如此,我覺得無論如何都要在這里記下它可能會有所幫助......
編輯2:這是我對same-function.c的補丁:
$ git diff -u -U0
diff --git a/same-function.c b/same-function.c
index f78449e..78a5772 100644
--- a/same-function.c
+++ b/same-function.c
@@ -20 +20 @@ done
-volatile uint64_t counter = 0;
+//volatile uint64_t counter = 0;
@@ -22,0 +23 @@ COMPILER_NO_INLINE void loop0(void) {
+volatile uint64_t counter = 0;
@@ -31,0 +33 @@ COMPILER_NO_INLINE void loop1(void) {
+volatile uint64_t counter = 0;
@@ -40,0 +43 @@ COMPILER_NO_INLINE void loop2(void) {
+volatile uint64_t counter = 0;
@@ -49,0 +53 @@ COMPILER_NO_INLINE void loop3(void) {
+volatile uint64_t counter = 0;
@@ -58,0 +63 @@ COMPILER_NO_INLINE void loop4(void) {
+volatile uint64_t counter = 0;
@@ -67,0 +73 @@ COMPILER_NO_INLINE void loop5(void) {
+volatile uint64_t counter = 0;
@@ -76,0 +83 @@ COMPILER_NO_INLINE void loop6(void) {
+volatile uint64_t counter = 0;
@@ -85,0 +93 @@ COMPILER_NO_INLINE void loop7(void) {
+volatile uint64_t counter = 0;
@@ -94,0 +103 @@ COMPILER_NO_INLINE void loop8(void) {
+volatile uint64_t counter = 0;
@@ -103,0 +113 @@ COMPILER_NO_INLINE void loop9(void) {
+volatile uint64_t counter = 0;
@@ -135 +145 @@ int main(int argc, char** argv) {
-}
\ No newline at end of file
+}
編輯3 :更簡單的例子也是如此:
; Minimal example, see also http://stackoverflow.com/q/26266953/3766665
; To build (Linux):
; nasm -felf64 func.asm
; ld func.o
; Then run:
; perf stat -r10 ./a.out
; Runtime varies ~10% depending on whether
section .text
global _start
_start:
push qword 0h ; put counter variable on stack
jmp loop ; jump to function
;align 64 ; function alignment. Try commenting this
loop:
mov rcx, 1000000000
;align 64 ; loop alignment. Try commenting this
l:
mov rax, [rsp]
add rax, 1h
mov [rsp], rax
sub rcx, 1h
jne l
fin: ; End of program. Exit with code 0
mov eax, 60
xor edi, edi
syscall
這里效果相同。 有趣。
干杯,本傑明
幾年前,我會告訴你檢查CPU到達任何一個循環時內部狀態的差異; 因為眾所周知,這會對無序預測過程(或類似的東西)的容量產生深遠的影響。 例如,同一個循環的性能可能會改變15-20%,具體取決於CPU在進入循環之前所做的事情,只是從兩個不同點跳轉的事實可能足以改變執行速度。
在你的情況下,它是非常值得測試的。 您所要做的就是改變IF塊中指令的順序; 例如替換以下內容:
switch (firstLetter) {
case '0': loop0(); break;
case '1': loop1(); break;
case '2': loop2(); break;
case '3': loop3(); break;
case '4': loop4(); break;
case '5': loop5(); break;
case '6': loop6(); break;
case '7': loop7(); break;
case '8': loop8(); break;
case '9': loop9(); break;
default: goto die_usage;
}
有:
switch (firstLetter) {
case '9': loop9(); break;
case '8': loop8(); break;
case '7': loop7(); break;
case '6': loop6(); break;
case '5': loop5(); break;
case '4': loop4(); break;
case '3': loop3(); break;
case '2': loop2(); break;
case '1': loop1(); break;
case '0': loop0(); break;
default: goto die_usage;
}
或任何隨機順序。 當然,您應該檢查生成的匯編代碼,以確保編譯器沒有重新排序指令的順序。
此外,因為你的循環在個別函數內; 您還應該確保這些函數本身在64字節邊界上對齊。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.