简体   繁体   English

为什么 AMD-CPU 有如此愚蠢的 PAUSE-timing

[英]Why have AMD-CPUs such a silly PAUSE-timing

I've developed a monitor-object like that of Java for C++ with some improvements.我已经为 C++ 开发了一个类似 Java 的监视器对象,并进行了一些改进。 The major improvement is that there's not only a spin-loop for locking and unlocking but also for waiting on an event.主要的改进是不仅有一个用于锁定和解锁的自旋循环,还有用于等待事件的自旋循环。 In this case you don't have to lock the mutex but supply a predicate on a wait_poll-function and the code repeatedly tries to lock the mutex polling and if it can lock the mutex it calls the predicate which returns (or moves) a pair of a bool and the result-type.在这种情况下,您不必锁定互斥锁,而是在 wait_poll 函数上提供一个谓词,并且代码反复尝试锁定互斥锁轮询,如果它可以锁定互斥锁,它会调用返回(或移动)一对的谓词布尔值和结果类型。

Waiting to for a semaphore and or a event-object (Win32) in the kernel can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.等待 kernel 中的信号量和/或事件对象 (Win32) 很容易花费 1.000 到 10.000 个时钟周期,即使调用立即返回也是如此,因为之前已经设置了信号量或事件。 So there has to be a spin count with a reasonable relationship to this waiting-inteval, fe spinning one tenth of the minimum interval being spent in the kernel.因此,必须有一个与此等待间隔具有合理关系的旋转计数,fe 旋转 kernel 中花费的最小间隔的十分之一。

With my monitor-object I've taken the spincount recalculation-algorithm from the glibc.对于我的监视器对象,我从 glibc 中获取了自旋计数重新计算算法。 And I'm also using the PAUSE-instruction.我也在使用暂停指令。 But I found that on my CPU (TR 3900X) the pause instruction is too fast.但是我发现在我的 CPU(TR 3900X)上暂停指令太快了。 It's about 0,78ns on average.平均约为 0.78ns。 On Intel-CPUs its much more reasonable with about 30ns.在 Intel-CPU 上,大约 30ns 更合理。

This is the code:这是代码:

#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
    static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
    auto start = high_resolution_clock::now();
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
        _mm_pause();
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
    cout << ns << endl;
}

Why has AMD taken such a silly PAUSE-timing?为什么 AMD 采取了如此愚蠢的暂停时间? PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back. PAUSE 用于自旋等待循环,应该与缓存行内容翻转到不同核心并返回所需的时间非常匹配。

But I found that on my CPU (TR 3900X) the pause instruction is too fast.但是我发现在我的 CPU(TR 3900X)上暂停指令太快了。 It's about 0,78ns on average.平均约为 0.78ns。 On Intel-CPUs its much more reasonable with about 30ns.在 Intel-CPU 上,大约 30ns 更合理。

The pause instruction has never had anything to do with time and is not intended to be used as a time delay. pause指令与时间没有任何关系,也不打算用作时间延迟。

What pause is for is to prevent the CPU from wasting its resources (speculatively) executing many iterations of a loop in parallel; pause的目的是防止 CPU 浪费其资源(推测性地)并行执行循环的许多迭代; which is especially useful in hyper-threading situations where a different logical processor in the core can use those resources, but also useful to improve the time it takes to exit the loop when the condition changes (because you don't have "N iterations" of instructions queued up from before the condition changed).这在超线程情况下特别有用,在这种情况下,核心中的不同逻辑处理器可以使用这些资源,但也有助于改善条件改变时退出循环所需的时间(因为你没有“N 次迭代”在条件改变之前排队的指令数)。

Given this;鉴于这种; for an extremely complex CPU that might have 200 instruction in flight at the same time, pause itself might happen instantly but cause a "200 cycle long" pipeline bubble in its wake;对于一个可能同时有 200 条指令在运行的极其复杂的 CPU, pause本身可能会立即发生,但会在其唤醒时导致“200 个周期长”的管道气泡; and for an extremely simple CPU ("in order" with no speculative execution) pause may/should do literally nothing (treated as a nop ).对于一个极其简单的 CPU(“按顺序”,没有推测性执行), pause可能/应该实际上什么都不做(被视为nop )。

PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back. PAUSE 用于自旋等待循环,应该与缓存行内容翻转到不同核心并返回所需的时间非常匹配。

No. Assume the cache line is in the "modified" state in a different CPU's cache and the instruction after the pause is something like " cmp [lock],0 " that causes the CPU to try to put the cache line into the "shared" state. How long should the CPU waste time doing nothing for no reason after the pause but before trying to put the cache line into the "shared" state?不。假设缓存行在不同 CPU 的缓存中的“修改”state 中, pause后的指令类似于“ cmp [lock],0 ”,这会导致 CPU 尝试将缓存行放入“共享” " state。在pause之后但在尝试将缓存行放入“共享”state 之前,CPU 应该无缘无故地浪费时间多长时间?

Note: If you actually need a tiny time delay, then you'd want to look at the umwait instruction.注意:如果您确实需要一个微小的时间延迟,那么您需要查看umwait指令。 You don't need a time delay though - you want a time-out (eg "spin with pause ; until rdtsc says an certain amount of time has passed). For this I'd be tempted to break it into an inner loop that does " pause and check for condition N times" then an outer loop that does "retry inner loop if time limit not expired yet".你不需要时间延迟 - 你想要一个超时(例如“ pause旋转;直到rdtsc说已经过了一定的时间)。为此我很想把它分成一个内部循环确实“ pause并检查条件 N 次”然后执行“如果时间限制尚未到期则重试内部循环”的外部循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 所有实现AMD64指令集的CPU是否都具有相同的指令和寄存器? - Are all of the CPUs that implement the AMD64 instruction set have the same instructions and registers? zen 3 CPU 上有多少条 AMD 专有指令? - How many AMD exclusive instructions are there on zen 3 CPUs? 在现代AMD64 CPU上最快的内存设置方式 - Fastest way to memset on modern amd64 CPUs Java strictfp 修饰符对现代 CPU 有什么影响吗? - Does Java strictfp modifier have any effect on modern CPUs? 在 64 位 x64/Amd64 处理器上执行 8 位和 64 位指令的时序 - Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors 为什么在AMD64中删除了BCD指令? - Why BCD instructions were removed in AMD64? 在某些情况下,在x86-64 Intel / AMD CPU上,128bit / 64bit硬件无符号除法能否比64bit / 32bit除法更快? - Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs? 为什么英特尔没有提供其CPU寄存器的高阶部分? - Why didn't Intel made the high order part of their CPUs' registers available? 为什么32位应用程序可以在64位x86 CPU上运行? - Why do 32-bit applications work on 64-bit x86 CPUs? 如果在 Intel Skylake CPU 上调用 function,为什么我的空循环运行速度会快两倍? - Why does my empty loop run twice as fast if called as a function, on Intel Skylake CPUs?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM