Slowing down CPU Frequency by imposing memory stress

Question

I stressed my system to see how it affects some program i wrote using stress-ng.

The program itself is a neural network, mainly composed of some nested loops doing some multiplication and using about 1G of RAM overall coded in C++.

I imposed some memory stress on the system using:

stress-ng --vm 4 --vm-bytes 2G -t 100s

which creates 4 workers spinning on mmap allocating 2G of RAM each. This slows down the execution of my program significantly (from about 150ms to 250ms). But the reason for the program to slow down is not lack of memory or memory-bandwidth or something. Instead the CPU cycles decrease from 3.4GHz (without stress-ng) to 2.8GHz (with stress-ng). The CPU utilization stays about the same (99%), as expected.

I measured the CPU frequency using

sudo perf stat -B ./my_program

Does anybody know why memory stress slows down the CPU?

My CPU is an Intel(R) Core(TM) i5-8250U and my OS is Ubuntu 18.04.

kind regards lpolari

Answer 1

Skylake-derived CPUs do lower their core clock speed when bottlenecked on load / stores, at energy vs. performance settings that favour more powersaving. Surprisingly, you can construct artificial cases where this downclocking happens even with stores that all hit in L1d cache, or loads from uninitialized memory (still CoW mapped to the same zero pages).

Skylake introduced full hardware control of CPU frequency (hardware P-state = HWP). https://unix.stackexchange.com/questions/439340/what-are-the-implications-of-setting-the-cpu-governor-to-performance Frequency decision can take into account internal performance-monitoring which can notice things like spending most cycles stalled, or what it's stalled on. I don't know what heuristic exactly Skylake uses.

You can repro this ¹ by looping over a large array without making any system calls. If it's large (or you stride through cache lines in an artificial test), perf stat./a.out will show the average clock speed is lower than for normal CPU-bound loops.

In theory, if memory is totally not keeping up with the CPU, lowering the core clock speed (and holding memory controller constant) shouldn't hurt performance much. In practice, lowering the clock speed also lowers the uncore clock speed (ring bus + L3 cache), somewhat worsening memory latency and bandwidth as well.

Part of the latency of a cache miss is getting the request from the CPU core to the memory controller, and single-core bandwidth is limited by max concurrency (outstanding requests one core can track) / latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

eg my i7-6700k drops from 3.9GHz to 2.7GHz when running a microbenchmark that only bottlenecks on DRAM at default bootup settings. (Also it only goes up to 3.9GHz instead of 4.0 all-core or 4.2GHz with 1 or 2 cores active as configured in the BIOS, with the default balance_power EPP settings on boot or with balance_performance .)

This default doesn't seem very good, too conservative for "client" chips where a single core can nearly saturate DRAM bandwidth, but only at full clock speed. Or too aggressive about powersaving, if you look at it from the other POV, especially for chips like my desktop with a high TDP (95W) that can sustain full clock speed indefinitely even when running power-hungry stuff like x265 video encoding making heavy use of AVX2.

It might make more sense with a ULV 15W chip like your i5-8250U to try to leave more thermal / power headroom for when the CPU is doing something more interesting.

This is governed by their Energy / Performance Preference (EPP) setting . It happens fairly strongly at the default balance_power setting. It doesn't happen at all at full performance , and some quick benchmarks indicate that balance_performance also avoids this powersaving slowdown. I use balance_performance on my desktop.

"Client" (non-Xeon) chips before Ice Lake have all cores locked together so they run at the same clock speed (and will all run higher if even one of them is running something not memory bound, like a while(1) { _mm_pause(); } loop). But there's still an EPP setting for every logical core. I've always just changed the settings for all cores to keep them the same:

On Linux, reading the settings:

$ grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
/sys/devices/system/cpu/cpufreq/policy1/energy_performance_preference:balance_performance
...
/sys/devices/system/cpu/cpufreq/policy7/energy_performance_preference:balance_performance

Writing the settings:

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;
 do echo balance_performance > "$i"; done'

See also

x86_energy_perf_policy(8) man page
Linux kernel docs for Intel Performance and Energy Bias Hint

Footnote 1: experimental example:

Store 1 dword per cache line, advancing through contiguous cache lines until end of buffer, then wrapping the pointer back to the start. Repeat for a fixed number of stores, regardless of buffer size.

;; t=testloop; nasm -felf64 "$t.asm" && ld "$t.o" -o "$t" && taskset -c 3 perf stat -d -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread ./"$t"

;; nasm -felf64 testloop.asm
;; ld -o testloop testloop.o
;; taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop

; or idq.mite_uops 

default rel
%ifdef __YASM_VER__
;    CPU intelnop
;    CPU Conroe AMD
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:

    lea        rdi, [buf]
    lea        rsi, [endbuf]
;    mov        rsi, qword endbuf           ; large buffer.  NASM / YASM can't actually handle a huge BSS and hit a failed assert (NASM) or make a binary that doesn't reserve enough BSS space.

    mov     ebp, 1000000000

align 64
.loop:
%if 0
      mov  eax, [rdi]              ; LOAD
      mov  eax, [rdi+64]
%else
      mov  [rdi], eax              ; STORE
      mov  [rdi+64], eax
%endif
    add  rdi, 128
    cmp  rdi, rsi
    jae  .wrap_ptr        ; normally falls through, total loop = 4 fused-domain uops
 .back:

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

.wrap_ptr:
   lea  rdi, [buf]
   jmp  .back


section .bss
align 4096
;buf:    resb 2048*1024*1024 - 1024*1024     ; just under 2GiB so RIP-rel still works
buf:    resb 1024*1024 / 64     ; 16kiB = half of L1d

endbuf:
  resb 4096        ; spare space to allow overshoot

Test system: Arch GNU/Linux, kernel 5.7.6-arch1-1. (And NASM 2.14.02, ld from GNU Binutils 2.34.0).

CPU: i7-6700k Skylake
motherboard: Asus Z170 Pro Gaming, configured in BIOS for 1 or 2 core turbo = 4.2GHz, 3 or 4 core = 4.0GHz. But the default EPP setting on boot is balance_power , which only ever goes up to 3.9GHz. My boot script changes to balance_pwerformance which still only goes to 3.9GHz so fans stay quiet, but is less conservative.
DRAM: DDR4-2666 (irrelevant for this small test with no cache misses).

Hyperthreading is enabled, but the system is idle and the kernel won't schedule anything on the other logical core (the sibling of the one I pinned it to), so it has a physical core to itself.

However, this means perf is unwilling to use more programmable perf counters for one thread, so perf stat -d to monitor L1d loads and replacement, and L3 hit / miss would mean less accurate measuring for cycles and so on. It's negligible, like 424k L1-dcache-loads (probably in kernel page-fault handlers, interrupt handlers, and other overhead, because the loop has no loads). L1-dcache-load-misses is actually L1D.REPLACEMENT and is even lower, like 48k

I used a few perf events, including exe_activity.bound_on_stores -[Cycles where the Store Buffer was full and no outstanding load]. (See perf list for descriptions, and/or Intel's manuals for more).

EPP: `balance_power` : 2.7GHz downclock out of 3.9GHz

EPP setting: balance_power with sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_power > "$i";done'

There is throttling based on what the code is doing; with a pause loop on another core keeping clocks high, this would run faster on this code. Or with different instructions in the loop.

# sudo ... balance_power
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 

 Performance counter stats for './testloop':

            779.56 msec task-clock:u              #    1.000 CPUs utilized          
            779.56 msec task-clock                #    1.000 CPUs utilized          
                 3      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 6      page-faults               #    0.008 K/sec                  
     2,104,778,670      cycles                    #    2.700 GHz                    
     2,008,110,142      branches                  # 2575.962 M/sec                  
     7,017,137,958      instructions              #    3.33  insn per cycle         
     5,217,161,206      uops_issued.any           # 6692.465 M/sec                  
     7,191,265,987      uops_executed.thread      # 9224.805 M/sec                  
       613,076,394      exe_activity.bound_on_stores #  786.442 M/sec                  

       0.779907034 seconds time elapsed

       0.779451000 seconds user
       0.000000000 seconds sys

By chance, this happened to get exactly 2.7GHz. Usually there's some noise or startup overhead and it's a little lower. Note that 5217951928 front-end uops / 2106180524 cycles = ~2.48 average uops issued per cycle, out of a pipeline width of 4, so this is not low-throughput code. The instruction count is higher because of macro-fused compare/branch. (I could have unrolled more so even more of the instructions were stores, less add and branch, but I didn't.)

(I re-ran the perf stat command a couple times so the CPU wasn't just waking from low-power sleep at the start of the timed interval. There are still page faults in the interval, but 6 page faults are negligible over a 3/4 second benchmark.)

`balance_performance` : full 3.9GHz, top speed for this EPP

No throttling based on what the code is doing.

# sudo ... balance_performance
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 

 Performance counter stats for './testloop':

            539.83 msec task-clock:u              #    0.999 CPUs utilized          
            539.83 msec task-clock                #    0.999 CPUs utilized          
                 3      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 6      page-faults               #    0.011 K/sec                  
     2,105,328,671      cycles                    #    3.900 GHz                    
     2,008,030,096      branches                  # 3719.713 M/sec                  
     7,016,729,050      instructions              #    3.33  insn per cycle         
     5,217,686,004      uops_issued.any           # 9665.340 M/sec                  
     7,192,389,444      uops_executed.thread      # 13323.318 M/sec                 
       626,115,041      exe_activity.bound_on_stores # 1159.827 M/sec                  

       0.540108507 seconds time elapsed

       0.539877000 seconds user
       0.000000000 seconds sys

About the same on a clock-for-clock basis, although slightly more total cycles where the store buffer was full. (That's between the core and L1d cache, not off core, so we'd expect about the same for the loop itself. Using -r10 to repeat 10 times, that number is stable +- 0.01% across runs.)

`performance` : 4.2GHz, full turbo to the highest configured freq

No throttling based on what the code is doing.

# sudo ... performance
taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop

 Performance counter stats for './testloop':

            500.95 msec task-clock:u              #    1.000 CPUs utilized          
            500.95 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 7      page-faults               #    0.014 K/sec                  
     2,098,112,999      cycles                    #    4.188 GHz                    
     2,007,994,492      branches                  # 4008.380 M/sec                  
     7,016,551,461      instructions              #    3.34  insn per cycle         
     5,217,839,192      uops_issued.any           # 10415.906 M/sec                 
     7,192,116,174      uops_executed.thread      # 14356.978 M/sec                 
       624,662,664      exe_activity.bound_on_stores # 1246.958 M/sec                  

       0.501151045 seconds time elapsed

       0.501042000 seconds user
       0.000000000 seconds sys

Overall performance scales linearly with clock speed, so this is a ~1.5x speedup vs. balance_power . (1.44 for balance_performance which has the same 3.9GHz full clock speed.)

With buffers large enough to cause L1d or L2 cache misses, there's still a difference in core clock cycles.

Answer 2

It's important to remember that modern CPUs, especially those made by Intel, have variable clock frequencies. The CPU will run slowly when lightly loaded to conserve power, which extends battery life, but can ramp up under load.

The limiting factor is thermals , that is the CPU will only be allowed to get so hot before the frequency is trimmed to reduce power consumption, and by extension, heat generation.

On a chip with more than one core, a single core can be run very quickly without hitting thermal throttling. Two cores must run slower, they're producing effectively twice the heat, and when using all four cores each has to share a smaller slice of the overall thermal budget.

It's worth checking your CPU temperature as the tests are running as it will likely be hitting some kind of cap.

Answer 3

The last time I looked at this, it was enabling the "energy-efficient Turbo" setting that allowed the processor to do this. Roughly speaking, the hardware monitors the Instructions Per Cycle and refrains from continuing to increase the Turbo frequency if increased frequency does not result in adequate increased throughput. For the STREAM benchmark, the frequency typically dropped a few bins, but the performance was within 1% of the asymptotic performance.

I don't know if Intel has documented how the "Energy Efficient Turbo" setting interacts with all of the various flavors of "Energy-Performance Preference". In our production systems "Energy Efficient Turbo" is disabled in the BIOS, but it is sometimes enabled by default....

Slowing down CPU Frequency by imposing memory stress

Question

3 answers

solution1
5 ACCPTED 2020-08-13 19:13:12

Footnote 1: experimental example:

EPP: `balance_power` : 2.7GHz downclock out of 3.9GHz

`balance_performance` : full 3.9GHz, top speed for this EPP

`performance` : 4.2GHz, full turbo to the highest configured freq

solution2
2 2020-08-13 16:56:35

solution3
2 2020-08-15 16:50:09

Slowing down CPU Frequency by imposing memory stress

Question

3 answers

solution1 5 ACCPTED 2020-08-13 19:13:12

Footnote 1: experimental example:

EPP: balance_power : 2.7GHz downclock out of 3.9GHz

balance_performance : full 3.9GHz, top speed for this EPP

performance : 4.2GHz, full turbo to the highest configured freq

solution2 2 2020-08-13 16:56:35

solution3 2 2020-08-15 16:50:09

solution1
5 ACCPTED 2020-08-13 19:13:12

EPP: `balance_power` : 2.7GHz downclock out of 3.9GHz

`balance_performance` : full 3.9GHz, top speed for this EPP

`performance` : 4.2GHz, full turbo to the highest configured freq

solution2
2 2020-08-13 16:56:35

solution3
2 2020-08-15 16:50:09