简体繁体 English

lock xchg与mfence具有相同的行为吗？

[英]Does lock xchg have the same behavior as mfence?

原文 2016-11-03 18:59:41 2 1 multithreading/ assembly/ x86/ cpu-architecture/ memory-barriers

What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. 我想知道的是，如果一个线程访问一个正在被其他线程突变的内存位置（让我们随便说），那么lock xchg会对mfence产生类似的行为。 Does it guarantee I get the most up to date value? 它能保证我获得最新的价值吗？ Of memory read/write instructions that follow after? 之后的内存读/写指令？

The reason for my confusion is: 我混淆的原因是：

8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.” 8.2.2“读取或写入不能通过I / O指令，锁定指令或序列化指令重新排序。”

-Intel 64 Developers Manual Vol. -Intel 64 Developers Manual Vol。 3 3

Does this apply across threads? 这是否适用于线程？

mfence states: mfence说：

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. 对MFENCE指令之前发出的所有内存加载和存储到内存指令执行序列化操作。 This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. 此序列化操作保证在MFENCE指令之前的任何加载或存储指令全局可见之前，在程序顺序之前的每条加载和存储指令都是全局可见的。 The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE指令针对所有加载和存储指令，其他MFENCE指令，任何SFENCE和LFENCE指令以及任何序列化指令（例如CPUID指令）进行排序。

-Intel 64 Developers Manual Vol 3A -Intel 64 Developers Manual Vol 3A

This sounds like a stronger guarantee. 这听起来更有力。 As it sounds like mfence is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date. 因为听起来mfence几乎正在mfence写写缓冲区，或者至少延伸到写缓冲区和其他内核以确保我未来的加载/存储是最新的。

When bench-marked both instructions take on the order of ~100 cycles to complete. 当基准标记时，两个指令都需要约100个循环才能完成。 So I can't see that big of a difference either way. 所以我无论如何都看不出那么大的差异。

Primarily I am just confused. 主要是我只是困惑。 I instructions based around lock used in mutexes, but then these contain no memory fences. 我的指令基于互斥lock使用的锁，但是这些包含没有内存栅栏。 Then I see lock free programming that uses memory fences, but no locks. 然后，我看到锁使用内存栅栏自由编程，但没有锁。 I understand AMD64 has a very strong memory model, but stale values can persist in cache. 我知道AMD64有一个非常强大的内存模型，但过时的值可以在缓存中持续存在。 If lock doesn't behave the same behavior as mfence then how do mutexes help you see the most recent value? 如果lock的行为与mfence行为不同，那么互斥锁如何帮助您查看最新值？

1 个解决方案

I believe your question is the same as asking if mfence has the same barrier semantics as the lock -prefixed instructions on x86, or if it provides fewer ¹ or additional guarantees in some cases. 我相信你的问题与询问mfence是否具有与x86上的lock前缀指令相同的屏障语义，或者它是否在某些情况下提供更少的¹或额外保证相同。

My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and lock ed instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). 我目前最好的答案是，这是英特尔的意图，并且ISA文档保证mfence和lock指令提供相同的防护语义，但由于实现疏忽， mfence实际上在最近的硬件上提供了更强的防护语义（至少从Haswell开始）。 In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while lock ed instructions do not. 特别是， mfence可以mfence来自WC型存储区域的后续非临时负载 ，而lock指令则不会。

We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory: 我们知道这一点，因为英特尔在处理器勘误中告诉我们这一点，例如HSD162（Haswell）和SKL155（Skylake），它告诉我们锁定的指令不会阻止从WC内存的后续非时间读取：

MOVNTDQA From WC Memory May Pass Earlier Locked Instructions 来自WC内存的MOVNTDQA可能会通过更早的锁定指令

Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an earlier locked instruction that accesses a different cache line. 问题：从WC（写入组合）存储器加载的（V）MOVNTDQA（流加载指令）的执行可能看起来通过访问不同高速缓存行的较早锁定指令。

Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly. 含义：期望锁定后续（V）MOVNTDQA指令的软件可能无法正常运行。

Workaround: None identified. 解决方法：未确定。 Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA should insert an MFENCE instruction between the locked instruction and subsequent (V)MOVNTDQA instruction. 依赖于锁定指令来阻止后续执行（V）MOVNTDQA的软件应在锁定指令和后续（V）MOVNTDQA指令之间插入MFENCE指令。

From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata ^0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead. 由此，我们可以判断：（1）英特尔可能是打算从WC型内存锁定指令围栏NT负载，否则这不会是一个勘误表^0.5（2）锁定指令实际上没有这样做，英特尔无法或不选择使用微代码更新来修复此问题，建议使用mfence 。

In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock -instruction errata, but applies to mfence . 在Skylake中， mfence实际上失去了相对于NT负载的附加防护能力，根据SKL079：来自WC内存的MOVNTDQA可以通过早期的MFENCE指令 - 这与lock -instruction勘误表几乎相同，但适用于mfence 。 However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this". 但是，这个勘误表的状态是“BIOS可能包含此错误的解决方法。”，这通常是英特尔所说的“微代码更新解决了这个问题”。

This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. 这个勘误序列也许可以用时间来解释：Haswell勘误表只出现在2016年初，即该处理器发布后的几年，所以我们可以假设这个问题在此之前的适当时间内引起了英特尔的注意。 At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. 在这一点上，Skylake几乎可以肯定已经出现在野外，显然是一个不太保守的mfence实现，也没有在WC类型的内存区域mfence NT负载。 Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. 修复锁定指令一直工作到Haswell的方式可能要么根本不可能或昂贵，基于它们的广泛使用，但需要一些方法来限制NT负载。 mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too. mfence显然已经完成了Haswell的工作，Skylake将被修复，以便mfence也在那里工作。

It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however. 它并没有真正解释为什么SKL079（ mfence one）出现在2016年1月，差不多两年之前SKL155（ locked一个）出现在2017年底，或者为什么后者在完全相同的Haswell勘误之后出现了这么多。

One might speculate on what Intel will do in the future. 人们可能会猜测英特尔将来会做些什么。 Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. 由于他们无法/愿意通过Skylake更改Haswell的lock指令，代表数亿（数十亿？）已部署的芯片，他们永远无法保证锁定的指令可以阻止NT加载，因此他们可能会考虑制作这是未来记录的架构行为。 Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation. 或者他们可能会更新锁定的指令，因此他们确实对这些读取进行了限制，但实际上你可能不会依赖这十年或更长时间，直到具有当前非击剑行为的筹码几乎没有流通。

Similar to Haswell, according to BV116 and BJ138 , NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. 与Haswell类似，根据BV116和BJ138 ，NT负载可能分别通过Sandy Bridge和Ivy Bridge上的早期锁定指令。 It's possible that earlier microarchitectures also suffer from this issue. 早期的微体系结构也可能会遇到这个问题。 This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake. 在Skylake之后，Broadwell和微体系结构中似乎不存在这个“错误”。

Peter Cordes has written a bit about the Skylake mfence change at the end of this answer . Peter Cordes在这个答案的最后写了一些关于Skylake mfence变化的文章。

The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest. 在我知道勘误表之前，这个答案的剩余部分是我的原始答案，而这主要是出于历史兴趣。

Old Answer 老答案

My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (eg, NT stores) and perhaps between accesses weakly-ordered regions (eg, WC-type memory). 我对答案的猜测是， mfence提供了额外的屏障功能：在使用弱有序指令（例如，NT存储）的访问之间以及可能在访问弱有序区域（例如，WC类型存储器）之间。

That said, this is just an informed guess and you'll find details of my investigation below. 也就是说，这只是一个明智的猜测，你会在下面找到我的调查细节。

Details 细节

Documentation 文档

It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock -prefixed instruction (including xchg with a memory operand, which is implicitly locked). mfence的内存一致性影响与lock -prefixed指令（包括xchg与内存操作数，隐式锁定）提供的程度并不完全清楚。

I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock -prefixed operation. 我认为可以肯定地说，仅仅针对回写内存区域而不涉及任何非时间访问， mfence提供与lock前缀操作相同的排序语义。

What is open for debate is whether mfence differs at all from lock -prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved. 可以讨论的是，当涉及到上述情况之外的场景时，mfence是否与lock前缀指令mfence不同，特别是当访问涉及WB区域以外的区域或涉及非时间（流）操作时。

For example, you can find some suggestions (such as here or here ) that mfence implies strong barrier semantics when WC-type operations (eg, NT stores) are involved. 例如，您可以找到一些建议（例如此处或此处），当涉及WC类型的操作（例如，NT存储）时， mfence意味着强屏障语义。

For example, quoting Dr. McCalpin in this thread (emphasis added): 例如，在这个帖子中引用麦卡尔平博士（重点补充）：

The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. 围栏指令仅需要绝对确保所有非临时存储在随后的“普通”存储之前是可见的。 The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. 最重要的情况是并行代码，其中并行区域末端的“屏障”可能包括“普通”存储。 Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. 如果没有围栅，处理器可能仍然在写入组合缓冲区中修改了数据，但是通过屏障并允许其他处理器读取写入组合数据的“陈旧”副本。 This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case). 此方案也可能适用于操作系统从一个核心迁移到另一个核心的单个线程（不确定此情况）。

I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. 我不记得详细的推理（今天早上咖啡还不够），但是你想要在非临时商店之后使用的指令是MFENCE。 According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. 根据SWDM第3卷第8.2.5节，MFENCE是唯一一个防止后续加载和后续存储在完成栅栏之前执行的栅栏指令。 I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it! 令我感到惊讶的是，第11.3.1节没有提到这一点，它告诉你在使用写入组合时手动确保一致性是多么重要，但是没有告诉你如何做到这一点！

Let's check out the referenced section 8.2.5 of the Intel SDM: 我们来看看英特尔SDM的参考部分8.2.5：

Strengthening or Weakening the Memory-Ordering Model 加强或弱化记忆订购模型

The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory- ordering model to handle special programming situations. 英特尔64和IA-32架构提供了多种机制来加强或削弱内存排序模型，以处理特殊编程情况。 These mechanisms include: 这些机制包括：

• The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor. •I / O指令，锁定指令，LOCK前缀和序列化指令强制处理器的排序更强。

• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization capabilities for specific types of memory operations. •SFENCE指令（引入Pentium III处理器中的IA-32架构）和LFENCE和MFENCE指令（Pentium 4处理器中引入）为特定类型的存储器操作提供了存储器排序和序列化功能。

These mechanisms can be used as follows: 这些机制可以使用如下：

Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. 总线上的存储器映射设备和其他I / O设备通常对写入其I / O缓冲区的顺序很敏感。 I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. I / O指令可用于（IN和OUT指令）对此类访问强加写入顺序，如下所示。 Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. 在执行I / O指令之前，处理器等待程序中的所有先前指令完成，并且所有缓冲写入都要耗尽到存储器。 Only instruction fetch and page tables walks can pass I/O instructions. 只有指令获取和页表行走才能通过I / O指令。 Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed. 在处理器确定I / O指令已完成之前，不会开始执行后续指令。

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. 多处理器系统中的同步机制可能依赖于强存储器排序模型。 Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. 这里，程序可以使用诸如XCHG指令或LOCK前缀之类的锁定指令来确保对存储器的读取 - 修改 - 写入操作以原子方式执行。 Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”). 锁定操作通常像I / O操作一样操作，因为它们等待所有先前的指令完成，并且所有缓冲的写操作都要耗尽到存储器（参见第8.1.2节“总线锁定”）。

Program synchronization can also be carried out with serializing instructions (see Section 8.3). 也可以使用序列化指令执行程序同步（参见第8.3节）。 These instructions are typically used at critical procedure or task boundaries to force completion of all previous instructions before a jump to a new section of code or a context switch occurs. 这些指令通常用于关键过程或任务边界，以在跳转到新的代码段或上下文切换之前强制完成所有先前的指令。 Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction. 与I / O和锁定指令一样，处理器等待所有先前的指令完成，并且在执行序列化指令之前已将所有缓冲的写入耗尽到存储器。

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data . SFENCE，LFENCE和MFENCE指令提供了一种性能有效的方法，可确保在生成弱排序结果的例程和使用该数据的例程之间加载和存储内存顺序 。 The functions of these instructions are as follows: 这些说明的功能如下：

• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations. •SFENCE - 序列化程序指令流中SFENCE指令之前发生的所有存储（写入）操作，但不影响加载操作。

• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations. •LFENCE - 序列化程序指令流中LFENCE指令之前发生的所有加载（读取）操作，但不影响存储操作。

• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream. •MFENCE - 序列化程序指令流中MFENCE指令之前发生的所有存储和加载操作。

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction. 请注意，SFENCE，LFENCE和MFENCE指令提供了一种比CPUID指令更有效的控制内存排序的方法。

Contrary to Dr. McCalpin's interpretation ² , I see this section as somewhat ambiguous as to whether mfence does something extra. 与麦卡尔平博士的解释²相反，我认为这一部分对于mfence是否做了额外的事情有些含糊不清。 The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. 涉及IO，锁定指令和序列化指令的三个部分确实意味着它们在操作之前和之后的存储器操作之间提供了完全屏障。 They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO. 它们对于弱有序的存储器没有任何例外，并且在IO指令的情况下，人们还假设它们需要以弱有序的存储区域以一致的方式工作，因为这些通常用于IO。

Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data." 然后是FENCE指令的部分，它明确提到弱内存区域：“SFENCE，LFENCE和MFENCE指令**提供了一种性能有效的方法来确保在生成弱有序结果和例程的例程之间加载和存储内存排序消耗这些数据。“

Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? 我们是否在这些行之间进行了阅读，并认为这些是完成此操作的唯一指令，并且前面提到的技术（包括锁定指令）对弱内存区域没有帮助？ We can find some support for this idea by noting that fence instructions were introduced ³ at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions: 我们可以通过注意围栏指令与弱序非时态存储指令同时引入³以及11.6.13 Cacheability Hint指令中专门针对弱有序指令处理的文本来找到对此思想的一些支持：

The degree to which a consumer of data knows that the data is weakly ordered can vary for these cases. 数据消费者知道数据被弱排序的程度可能因这些情况而异。 As a result, the SFENCE or MFENCE instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume the data. 因此，应使用SFENCE或MFENCE指令来确保生成弱排序数据的例程和使用数据的例程之间的排序。 SFENCE and MFENCE provide a performance-efficient way to ensure ordering by guaranteeing that every store instruction that precedes SFENCE/MFENCE in program order is globally visible before a store instruction that follows the fence. SFENCE和MFENCE通过保证程序顺序中SFENCE / MFENCE之前的每个商店指令在跟随围栏的商店指令之前是全局可见的，提供了一种性能有效的方式来确保订购。

Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions. 同样，这里特别提到围栅指令适用于屏蔽弱有序指令。

We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above: 我们还发现支持这样一种观点，即锁定指令可能不会在上面已经引用的最后一个句子的弱有序访问之间提供障碍：

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction. 请注意，SFENCE，LFENCE和MFENCE指令提供了一种比CPUID指令更有效的控制内存排序的方法。

Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. 这基本上意味着FENCE指令实质上取代了先前由序列化cpuid在内存排序方面提供的功能。 However, if lock -prefixed instructions provided the same barrier capability as cpuid , that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. 但是，如果lock -prefixed指令提供了与cpuid相同的屏障功能，那么这可能是之前建议的方式，因为这些通常比cpuid快得多，后者通常需要200个或更多周期。 The implication being that there were scenarios (probably weakly ordered scenarios) that lock -prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock -prefixed instructions. 这意味着存在lock前缀指令无法处理的场景（可能是弱序场景），以及使用cpuid地方，以及现在建议将mfence作为替换，暗示比lock前缀指令更强的屏障语义。

However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. 但是，我们可以用不同的方式解释上面的一些内容：请注意，在围栏指令的上下文中，经常提到它们是确保排序的性能有效方式 。 So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for. 因此，这些说明可能并非旨在提供额外的障碍，而只是提供更有效的障碍。

Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock -prefixed instructions which are generally 20 cycles or more. 实际上，几个周期的sfence比串行化指令要快得多，例如cpuid或lock前缀指令，通常是20个周期或更多。 On the other hand mfence isn't generally faster than locked instructions ⁴ , at least on modern hardware. 在另一方面mfence 不是一般至少比在现代硬件锁定指令^4，速度更快。 Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out. 尽管如此，它在引入或未来某些设计时可能会更快，或者预计它会更快但但并没有成功。

So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way. 所以我不能根据手册的这些部分做出一定的评估：我认为你可以做出合理的论证，可以用任何一种方式解释它。

We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. 我们可以进一步查看英特尔ISA指南中各种非临时存储指令的文档。 For example, in the documentation for the non-temporal store movnti you find the following quote: 例如，在非临时存储movnti的文档中，您会找到以下引用：

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations. 由于WC协议使用弱排序的内存一致性模型，如果多个处理器可能使用不同的内存类型来读/写目标内存位置，则应使用SFENCE或MFENCE指令实现的防护操作与MOVNTI指令一起使用。

The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. 关于“如果多个处理器可能使用不同的存储器类型来读/写目标存储器位置”的部分对我来说有点混乱。 I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. 我希望这可以说“在使用弱排序提示的指令之间强制执行排序”或类似的东西。 Indeed, the actual memory type (eg, as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions. 实际上，实际的存储器类型 （例如，由MTTR定义）可能在这里甚至不起作用：当使用弱有序指令时，排序问题可能仅在WB存储器中出现。

Performance 性能

The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles. 据报道， mfence指令在现代CPU上基于Agner fog的指令时序需要33个周期（背靠背延迟），但据报道， lock cmpxchg等更复杂的锁定指令只需要18个周期。

If mfence provided barrier semantics no stronger than lock cmpxchg , the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer . 如果mfence提供的屏障语义不比lock cmpxchg强，后者正在严格执行更多的工作，并且没有明显的理由让mfence花费更长的时间 。 Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. 当然你可以争辩说， lock cmpxchg比mfence更重要，因此可以获得更多优化。 This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence , even infrequently used ones. 由于所有锁定的指令都比mfence ，即使不经常使用的指令，这个论点也会被削弱。 Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation. 此外，您可以想象如果所有lock指令共享一个屏障实现，则mfence将使用相同的那个，因为它是最简单且最容易验证的。

So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra . 因此，在我看来， mfence的较慢表现是证据表明mfence正在做一些额外的事情。

^0.5 This isn't a water-tight argument. ^0.5这不是一个不透水的论点。 Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug. 有些东西可能出现在勘误表中，显然是“按设计”而不是错误，例如popcnt错误依赖于目标寄存器 - 因此一些勘误表可以被视为一种更新期望的文档形式，而不是总是暗示硬件错误。

¹ Evidently, the lock -prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence , so the lock -prefixed instructions definitely have additional functionality. ¹显然， lock -prefixed指令还执行原子操作，这是不可能仅使用mfence实现的，因此lock前缀的指令肯定具有附加功能。 Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better. 因此，为了使mfence有用，我们希望它在某些场景中具有额外的屏障语义，或者表现更好。

² It is also entirely possible that he was reading a different version of the manual where the prose was different. ²他完全有可能阅读散文不同的手册的不同版本。

³ SFENCE in SSE, lfence and mfence in SSE2. ³ SFENCE在SSE， lfence和mfence在SSE2。

⁴ And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles. ⁴通常它的速度较慢：Agner在最近的硬件上列出了33个周期的延迟，而锁定指令通常约为20个周期。