为什么 AMD-CPU 有如此愚蠢的 PAUSE-timing

Question

I've developed a monitor-object like that of Java for C++ with some improvements.我已经为 C++ 开发了一个类似 Java 的监视器对象，并进行了一些改进。 The major improvement is that there's not only a spin-loop for locking and unlocking but also for waiting on an event.主要的改进是不仅有一个用于锁定和解锁的自旋循环，还有用于等待事件的自旋循环。 In this case you don't have to lock the mutex but supply a predicate on a wait_poll-function and the code repeatedly tries to lock the mutex polling and if it can lock the mutex it calls the predicate which returns (or moves) a pair of a bool and the result-type.在这种情况下，您不必锁定互斥锁，而是在 wait_poll 函数上提供一个谓词，并且代码反复尝试锁定互斥锁轮询，如果它可以锁定互斥锁，它会调用返回（或移动）一对的谓词布尔值和结果类型。

Waiting to for a semaphore and or a event-object (Win32) in the kernel can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.等待 kernel 中的信号量和/或事件对象 (Win32) 很容易花费 1.000 到 10.000 个时钟周期，即使调用立即返回也是如此，因为之前已经设置了信号量或事件。 So there has to be a spin count with a reasonable relationship to this waiting-inteval, fe spinning one tenth of the minimum interval being spent in the kernel.因此，必须有一个与此等待间隔具有合理关系的旋转计数，fe 旋转 kernel 中花费的最小间隔的十分之一。

With my monitor-object I've taken the spincount recalculation-algorithm from the glibc.对于我的监视器对象，我从 glibc 中获取了自旋计数重新计算算法。 And I'm also using the PAUSE-instruction.我也在使用暂停指令。 But I found that on my CPU (TR 3900X) the pause instruction is too fast.但是我发现在我的 CPU（TR 3900X）上暂停指令太快了。 It's about 0,78ns on average.平均约为 0.78ns。 On Intel-CPUs its much more reasonable with about 30ns.在 Intel-CPU 上，大约 30ns 更合理。

This is the code:这是代码：

#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
    static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
    auto start = high_resolution_clock::now();
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
        _mm_pause();
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
    cout << ns << endl;
}

Why has AMD taken such a silly PAUSE-timing?为什么 AMD 采取了如此愚蠢的暂停时间？ PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back. PAUSE 用于自旋等待循环，应该与缓存行内容翻转到不同核心并返回所需的时间非常匹配。

Answer 1

But I found that on my CPU (TR 3900X) the pause instruction is too fast.但是我发现在我的 CPU（TR 3900X）上暂停指令太快了。 It's about 0,78ns on average.平均约为 0.78ns。 On Intel-CPUs its much more reasonable with about 30ns.在 Intel-CPU 上，大约 30ns 更合理。

The pause instruction has never had anything to do with time and is not intended to be used as a time delay. pause指令与时间没有任何关系，也不打算用作时间延迟。

What pause is for is to prevent the CPU from wasting its resources (speculatively) executing many iterations of a loop in parallel; pause的目的是防止 CPU 浪费其资源（推测性地）并行执行循环的许多迭代； which is especially useful in hyper-threading situations where a different logical processor in the core can use those resources, but also useful to improve the time it takes to exit the loop when the condition changes (because you don't have "N iterations" of instructions queued up from before the condition changed).这在超线程情况下特别有用，在这种情况下，核心中的不同逻辑处理器可以使用这些资源，但也有助于改善条件改变时退出循环所需的时间（因为你没有“N 次迭代”在条件改变之前排队的指令数）。

Given this;鉴于这种; for an extremely complex CPU that might have 200 instruction in flight at the same time, pause itself might happen instantly but cause a "200 cycle long" pipeline bubble in its wake;对于一个可能同时有 200 条指令在运行的极其复杂的 CPU， pause本身可能会立即发生，但会在其唤醒时导致“200 个周期长”的管道气泡； and for an extremely simple CPU ("in order" with no speculative execution) pause may/should do literally nothing (treated as a nop ).对于一个极其简单的 CPU（“按顺序”，没有推测性执行）， pause可能/应该实际上什么都不做（被视为nop ）。

PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back. PAUSE 用于自旋等待循环，应该与缓存行内容翻转到不同核心并返回所需的时间非常匹配。

No. Assume the cache line is in the "modified" state in a different CPU's cache and the instruction after the pause is something like " cmp [lock],0 " that causes the CPU to try to put the cache line into the "shared" state. How long should the CPU waste time doing nothing for no reason after the pause but before trying to put the cache line into the "shared" state?不。假设缓存行在不同 CPU 的缓存中的“修改”state 中， pause后的指令类似于“ cmp [lock],0 ”，这会导致 CPU 尝试将缓存行放入“共享” " state。在pause之后但在尝试将缓存行放入“共享”state 之前，CPU 应该无缘无故地浪费时间多长时间？

Note: If you actually need a tiny time delay, then you'd want to look at the umwait instruction.注意：如果您确实需要一个微小的时间延迟，那么您需要查看umwait指令。 You don't need a time delay though - you want a time-out (eg "spin with pause ; until rdtsc says an certain amount of time has passed). For this I'd be tempted to break it into an inner loop that does " pause and check for condition N times" then an outer loop that does "retry inner loop if time limit not expired yet".你不需要时间延迟 - 你想要一个超时（例如“ pause旋转;直到rdtsc说已经过了一定的时间）。为此我很想把它分成一个内部循环确实“ pause并检查条件 N 次”然后执行“如果时间限制尚未到期则重试内部循环”的外部循环。

为什么 AMD-CPU 有如此愚蠢的 PAUSE-timing

问题描述

1 个解决方案

解决方案1
6 2021-09-26 12:17:48

为什么 AMD-CPU 有如此愚蠢的 PAUSE-timing

问题描述

1 个解决方案

解决方案1 6 2021-09-26 12:17:48

解决方案1
6 2021-09-26 12:17:48